We recently chatted with Andy Müller, core developer of scikit-learn and Principal Research Software Development Engineer at Microsoft. Andy is one of the most influential minds in data science with a CV to match. He shares his thoughts on distributed machine learning with open-source tools like Dask-ML as well as proprietary tools from the big cloud providers.
In this post we cover the following topics:
In a past post, we covered distributed ML use cases and discussed whether or not we really need distributed machine learning. You can check it out here.
This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.
If the model gets better the more data I use, I probably should be using more data. And if I can't use all my data on a single machine, I should probably use distributed machine learning.
HBA: How could someone figure out if they need to do distributed machine learning or if they just need to be smarter about the way they're doing machine learning?
Rapid prototyping is very important. I always try to subsample my data first and do some plotting and some exploration. There's a library I develop called dabl that makes everything work in one line of code. The point of the project is to enable people to train quickly. For distributed systems, nearly always the tooling will be harder because you just have more stuff to deal with.
And so I would recommend that people start with doing something small and easy first, then you can add more data and ask, “Does my model get better?” Use 1% of the data. Then use 2% of the data. Then 5% of the data. If the model gets better the more data I use, I probably should be using more data. And if I can't use all my data on a single machine, I should probably use distributed machine learning.
HBA: And you do that by plotting on learning curves or something?
AM: Basically. I always start with that small thing, like running a logistic regression on a small fraction of my data locally, and then just see what happens. And then I look at how having more data influences the results, how using more complex algorithms influences the results, etc. What are the important features? If your data is big because you have a million features, but it turns out only one of them is important, then you can just keep that one feature and your data will be much smaller.
It’s really always all about iterating quickly. Distributed machine learning is great because you figure out you need a lot of features or you need a lot of samples. Or maybe later in the process, you find that the model works well but you wish you could tune the hyperparameters much better. Distributed is great for that. But I probably wouldn't do it as the first thing.
HBA: We need to get out of the mindset that big data is the answer and more data is better.
AM: Yeah, it needs to be the right data. I mean, more of the right data is almost always better, but you don't know if you have the right data, you know?
HBA: Right. It's about finding a signal there.
To find the right tools, ask yourself: Where does my data sit? What tools are my team already familiar with? Do those packages have what I need?
HBA: Can you tell me a bit about the different tools for distributed ML? Both open-source and proprietary. And how people can figure out which ones to use.
AM: It really depends where your data sits. Like if your data is big, then your data sits somewhere and it sits in a cluster. And if you have some software that orchestrates your cluster, you probably want to use something that's easy to use with your existing infrastructure. If your system sits in a Spark cluster, maybe then using Spark ML is the obvious first choice. If your data sits in Azure, you might want to start with AzureML, if your infrastructure uses Amazon services, maybe you want to use SageMaker, and so on.
The choice also depends if the package has the thing that you need. For example, I don't think Spark ML does any deep learning.
And then the other thing is you want to use the tools that your team is familiar with. Like if everything sits in a Spark cluster, and you write a super great Scala implementation of a fancy machine learning algorithm, but no one else in your team can read it, then it's useless, right? That's one of the benefits of Dask—if it looks like pandas, then probably a lot of people can read it. You should use the things that act well with both your data and infrastructure and with the institutional knowledge you have.
HBA: That's fantastic. And, in fact, thank you for mentioning Dask because now, if you want to do distributed ML with Dask, you can get up and running easily with Coiled Cloud!
Success would look like: The big players coming together to empower abstraction layers that allow for running ML algorithms natively on different compute engines.
HBA: So what does the future of distributed machine learning look like to you?
AM: It would be great not to have to reinvent the wheel for each platform. And so ideally, what we want is people not to move data between systems every time they want to do something. We want to integrate into their existing workflow. In particular, for distributed ML. This is interacting with your huge user database. And both for training and for prediction. And so you want this integration, but you don't want to rewrite all algorithms for each of the systems, right? And so it would be great if we could have abstraction layers that allow us to run machine learning algorithms relatively natively on the different platforms on the different compute engines.
I think one of the downsides of Spark ML, for example, is I think they haven't put a huge effort into developing all their algorithms, which is a shame. But it's also a shame that you have to rewrite everything from scratch to run on Spark. It would be better if we could be more flexible. There are so many backends that people are using, and they're only appearing more quickly. And all of the big cloud companies have all of their own database solutions that you also want to integrate with. But do you want to only use the machine learning that they provide with their proprietary tools and with their integrated databases? I don't know.
It would be great to have machine learning tools that work relatively natively on whatever you have in your Google BigQuery. But then if you decide to move over to Amazon, or to Spark, or to whatever database, you don't have to completely change your whole toolkit.
HBA: So some form of distributed ML portability. This reminds me of the way that GDPR frames data portability. Another thing that resonated with me there the idea of having some standards that we all just agree upon. This is something Fernando Perez has discussed a lot with respect to the Jupyter ecosystem and something Wes McKinney thinks about with Arrow. Having a substrate that then we can all build on top of!
AM: Yeah, exactly. I think that that would be great. And so I think for the distributed setting, it's going to be much harder than for Arrow, and even Arrow has not found that much adoption yet, I hope and believe we will see more soon. So that's why the future is going to be tricky. Arrow as a data exchange format might be part of the solution, though. And it also requires the big players to, you know, come together.
HBA: Absolutely. We'll see what happens there. Andy, this has been super fun. I always enjoy our conversations. Have a great evening.
AM: Yeah, you too. Take care.
We hope you enjoyed this second installment of our conversation with Andy.
For more distributed machine learning and data science, check out our hosted notebook examples on Coiled Cloud by signing up below.