Dask is built and maintained by hundreds of people collaborating from around the world. In this series, we talk to some of these Dask contributors, discuss their journey into open source development, and hear their thoughts on the PyData ecosystem of tools. In this edition, we’re delighted to chat with Julia Signell, a core maintainer of the Dask project and the Head of Open Source at Saturn Cloud.
Starting out as an environmental engineer, Julia got involved in the PyData ecosystem early on. She recalls:
“After I went to my first SciPy conference, I was pretty sure that I was more interested in building open source tools than going to grad school and continuing on the research track.”
Julia joined Anaconda as an intern, which later turned into a full-time job. At Anaconda, she played a crucial role in building the HoloViz project that includes many Python visualization tools like HoloViews, hvPlot, and Datashader. Currently, Julia works at Saturn Cloud, where she contributes to Dask and builds a data science platform. She says:
“I love switching back and forth; spending part of my time working on a library and the other part working on an application where people can easily use that library.”
I started programming in earnest when I was trying to manage and visualize data as part of my work. I was working in an environmental engineering lab helping with research on high-intensity rainstorms in urban environments. One of my favorite parts of that job was making animations of lightning strikes moving across maps. For those maps, I was using xarray and Datashader. Those two libraries are still some of my favorites. I heard Datashader explained as "pixel-level histogramming" and that just made so much sense to me.
The most important part about open source for me is the community. It's hard to beat the feeling of reading someone's bug report and then being able to fix it for them while interacting directly with the original reporter. Or being able to ping another developer who works at a different company and talk through an approach. The community provides a sense of trust and continuity that I suspect is rare in the broader tech space.
I think I probably first heard about Dask as a backend for xarray objects.
But I don't think I fully understood Dask's scope until I started at Saturn Cloud and really got up to speed on how Dask works.
The first time I think I really got why some calculations are hard in distributed computing is when I realized that -- if you split up data into shards, take the mean on those shards, and then take the mean of the results, it's not the same as taking the mean of the whole dataset. That sounds obvious when I say it now, but at the time it was like a lightbulb. I finally realized that to match pandas, Dask has to reimplement every method in a way that works when parallelized and distributed.
I probably did one or two PRs on Dask while I was at Anaconda, but I didn't really get going until I started at Saturn Cloud and part of my job was to do maintenance on Dask. It was really great how the existing maintainers were willing to put time into helping me understand the project.
I mostly work on Dask proper. Right now, I am pretty honed in on better error-handling, ongoing pandas compatibility, and trying to make the docs easier to use. I'd like to find more time to work on dask-geopandas and xarray.
I love building things. Especially things that people use.
This is embarrassing, but I don't use Dask as much as I would like to. I just don't work with data very much anymore. To make up for this, I try to listen really carefully to what people tell me about their experiences and struggles using Dask.
My favorite part is that since Dask implements the NumPy API and the pandas API we don't really ever have to think about naming. Most of the work has to do with how to implement different methods in a parallel way; it's never about what the methods should be or what they should be called. It's wild how much energy it saves to follow an existing API.
My second favorite part is the representation of DataFrames, arrays, and even clients and clusters that you get in Jupyter. That kind of visual output is super valuable for building understanding. If you haven't taken a look recently, it has changed a lot - you should check it out!
Well at the moment I have been pretty focused on dask.dataframe. In particular, I have been thinking about how to implement more of the pandas API and how to be clearer about actions that aren't supported and why.
The API standardization in NEP18 and NEP35 is super impressive to me. The enhancements described by those proposals let you call a NumPy method on a Dask Array and get a Dask Array back or easily create a Dask Array where each of the partitions is a CuPy array rather than a NumPy array. With the proliferation of array types this seems essential to keeping these libraries usable. It would be awesome to see this concept expanded to other areas like visualization and scheduling.
Thank you, Julia, for all your contributions to the PyData ecosystem. As someone who is especially fond of plotting tools, I truly appreciate your contribution to the PyData visualization space.
Thanks for reading! And if you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.