Dask is built and maintained by hundreds of people collaborating from around the world. In this new series, we talk to Dask contributors, share their journey into programming and open source development, and hear their thoughts on the PyData ecosystem of tools.
We’re super excited to start this series with Genevieve Buckley, a Dask life science fellow, scientist, and programmer from Australia. Genevieve is currently working to improve Dask, specifically for life sciences. She has a background in physics and image analysis; and is very interested in deep learning, automated analysis, and open source.
Q: How did you get started with programming?
I studied physics at university, and back then computer programming wasn't a part of the curriculum the way it is these days. So I didn't know anything about computer science. I didn't know what a for loop was, and I didn't know how to open my terminal (or why I would want to do that), nothing. But I applied to do undergraduate research work with a lecturer of mine, and she taught me to code for that project. We used IDL to compare computer simulations of diffraction patterns to experimental data.
Q: Why is open source important to you?
Because I've spent my career working in science, we're often trying to do things that are a little bit weird, or strange, or something that no one has tried to do before. That means there often aren't established software tools. Open source code means having something you can tinker with, so you can change this thing that *almost* fits into something you can use. So that's a huge reason open source code is important to me.
Finally, I’ve noticed that open source tools have many more features than their closed source counterparts - I think often this is because of the huge number of person-hours that go into those projects.
Q: What open source projects do you contribute to?
Q: How did you first get introduced to Dask?
I co-organised a summer school teaching Python to scientists in 2019, and my colleague arranged for John Kirkham to come and teach the lectures on parallel Python. I wasn't familiar with Dask before that, but I was very interested in learning more about parallel programming and scalable computing - I had data problems I needed to solve more efficiently.
Q: How did you start contributing to Dask?
One of my colleagues Juan convinced John to come and teach by promising him a weekend to collaborate and hack on projects. I promised to join this too.
My very first contributions to dask-image are from that hack day - I wrote some how-to guides for dask-image. After that, I also started making some very small contributions to the main Dask repository, because some of the features I wanted for new things in dask-image weren't available.
Q: What part of Dask do you mainly contribute to?
Because I'm interested in image analysis for scientific data, most of my contributions focus on the array sub-package of Dask and the specialist dask-image project. The non-code contributions I make tend to be blog posts about applying Dask to specific problems and running tutorials on napari and Dask.
Q: Besides developing Dask, do you use it too? If so, how does one affect the other?
I develop more for Dask than I actually use it, which is not what you might typically expect. A large part of why that is, is because I get paid to work full time improving Dask and related projects for life sciences. So I collaborate a lot with other groups, and it's the things people can't do well that drive the priorities for where I spend time as a developer.
Q: What is your favorite part of Dask?
Because I work so much with imaging and array data, my favourite part of Dask to use is the map_overlap function. If you work on image data, you can get most of what you need done just with that one function.
My other favourite part of Dask is the dashboard. It's incredibly helpful to see what's happening while your computation runs. It's fantastic for getting good intuition about how your computation runs. If you're new to the dashboard, this 20-minute video is a great overview: Dask Dashboard walkthrough. There's also a JupyterLab extension, you can see a six-minute video on that here: Dask JupyterLab Extension.
Q: What are some things that you want to see improved in Dask?
As much as I love it, I wish the performance of map_overlap was faster. For several of our scientific users, this is the difference between being able to use Dask vs it not being a feasible option at all. It'd make a very big impact if we could improve this.
I spent quite a bit of time on this problem recently, and achieved some small improvements, although nothing earth-shattering. To really improve the speed, it's likely a much more drastic approach is needed. Matt [Rocklin] is trying out some new ideas elsewhere in Dask, and if that works well, it's possible we might be able to try a similar approach for this problem too. It's too soon to tell, but fingers crossed!
Thank you, Genevieve, for sharing your journey with us, and for all your amazing contributions to scientific Python!
Thanks for reading! And if you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.