Do we really need distributed machine learning?

Hugo Bowne-Anderson
October 20, 2020

We recently chatted with Andy Müller, core developer of scikit-learn and Principal Research Software Development Engineer at Microsoft. Andy is one of the most influential minds in data science with a CV to match. He shares his thoughts on distributed machine learning with open-source tools like Dask-ML as well as with proprietary tools from the big cloud providers.

In this post we cover the following topics:

  1. Is distributed machine learning actually useful?
  2. Distributed ML use cases: when to use it and when to not?

In an upcoming post, we'll cover tools for distributed machine learning and the future of the field.

This interview was lightly edited for clarity. Many thanks to David Venturi for his editorial help on this post.

distributed machine learning

Is distributed machine learning actually useful?

When your data doesn't fit on a single machine, you either have to do distributed learning or you have to throw away data. The question then is: what is the payoff of actually using all of your data?

Hugo Bowne-Anderson: Okay, Andy, so this is a slightly provocative question: does anybody really need to do distributed machine learning?

Andy Muller: So, first, I think it's important to specify what you mean by distributed machine learning. Is it training models? Or is it prediction? Because one of those is much easier than the other one.

Distributed prediction is super easy. People do it in production with scikit-learn all the time, because distributed prediction means you just spin up a bunch of boxes that all run the same prediction function. And that's distributed, but it's the boring kind of distributed. It's the one that really works well.

HBA: That's something you don't necessarily need Dask for. I mean, you can use Dask-ML for it, but you don’t need to.

AM: Yeah, but you do need to orchestrate your cluster in some way, right?

And then, distributed training. There are two use cases for that:

  • First, your data doesn't fit on the hard drive of a single machine. You could do streaming training, but you still need to stream in from somewhere. Your data needs to live somewhere.
  • And the other use case is just making things faster. And depending on the algorithm, even if your dataset does fit on a single machine, you might be faster by distributing, but that's probably not going to be a linear speed up. A lot of deep learning work is on distributed learning of algorithms.

When your data doesn't fit on a single machine, you either have to do distributed learning or you have to throw away data. The question then is what is the payoff of actually using all of your data?

HBA: And there's another trade off. Why not just buy a bigger machine? There are really big machines out there these days.

AM: Sure, but what if your biggest machine is not enough? So if your data is less than 100 gigabytes, then that definitely fits in a single machine. If you're at several terabytes, then it probably doesn't fit in a single machine, and then you have to use distributed computing.

HBA: Exactly so let's say you work at one of the big finance houses and you have 100 gigabytes of data that you want to do some machine learning on and you have access to a cluster. You could get a bigger machine, but you already have the cluster. So maybe in that there's a trade off there in terms of time and efficiency of getting the job done as well, right?

AM: Yeah, but it depends a little bit on the algorithm. It could be faster on the cluster, or it could be slower on the cluster. And it's hard to know beforehand. It would be interesting to look at more recent benchmarks for the random forests in scikit-learn versus random forests in Spark.

Last time I checked, which was a couple of years ago, scikit-learn won by a big margin. And so if you say, “It's cheaper to run on the cluster because we already have the cluster, but then we have to pay a data scientist for three more hours because they’re waiting around because the algorithm’s slow on the cluster,” maybe it would have been cheaper to just get the bigger machine.

HBA: I want to dive into that a bit more. Because one trend I've noticed is people are using distributed computing to iterate on their scientific process and data analysis more quickly. It actually changes the pace and the cadence of the scientific method for them.

In the machine learning world, if you can speed up the process and the training time and or hyperparameter tuning and all of this, it actually changes the way you do science as well. Not just in terms of speed, but the way you think about it and your own methodology.

AM: Yeah, I guess I've become disillusioned with too much AutoML. But doing hyperparameter tuning is something where you can get linear speed ups from parallelism. That’s really a great application. And that's where I use distributed machine learning most.

There are also some distributed versions of Hyperband. This is a place where, because you're running many different algorithms and training each algorithm is somewhat independent of the other algorithms, you can actually get pretty big speedups from doing distributed computation. 


Distributed ML use cases

We're getting machines to do what they're good at, and we do what we’re good at. I think people lose sight of that a lot of the time because they want the machines to do as much as possible.

HBA: What are some use cases for distributed machine learning? Maybe some obvious ones and some less obvious ones.

AM: Wow, use cases for machine learning. It's my favorite question.

HBA: *Distributed* machine learning.

AM: Distributed.

HBA: Yeah, so you can't say the Boston Housing Dataset.

AM: Well, I can say ad click prediction. That's the favorite.

HBA: Oh no, that's the most depressing one.

AM: So the one of the fun things about ad click prediction is that it's so imbalanced because no one ever clicks on ads. So you have like 1000 to 1 class imbalance. But that also means that you can probably throw away a lot of the data and still try to model because all of the non-clicks aren’t that informative. And so if you can go from from 1000 to 1 to 100 to 1, you already reduced your dataset size by ten times. And so now it fits into RAM again, you know.

HBA: What else?

AM: Recommender systems. It's not something we have in scikit-learn, but it's something that people use a lot in Spark ML where basically you do some matrix factorization and some collaborative filtering to do product recommendations or something like that. It's the classic Netflix challenge thing. Which movies should you watch?

Then maybe less classical ML. There are a bunch of areas in medical diagnosis where algorithms are actually pretty good. Distributed computing could be useful with future medical imaging data sets as well. People have worked a lot on tackling different diseases in eyes and that works really well using convolution neural networks.

HBA: A famous example of this was diabetic retinopathy, which is one of the leading causes of blindness. Google built a model that supposedly outperformed domain experts.

AM: The thing is, with machine learning models in the medical fields, they have so much less context. So they will usually just solve a very small subtask, like maybe just the image recognition tasks. But that's not how medicine works, right? The medical cost is not just this part, but there’s the whole interacting with the patient and the patient history and the plan aspects. But there are usually these narrow tasks where the machine learning algorithms can be much better. In particular, think about how many retinas a doctor can look at in their lifetime.

And think about how many retinas the deep learning algorithm can look at in one night. And you can see that one of them is several orders of magnitude larger than the other one. And so that makes sense that in this very narrowly defined task, the computer vision algorithm can do much better than a human.

But that doesn't mean we can get rid of humans, but it means we can maybe automate some tasks.

HBA: Exactly. We're getting machines to do what they're good at, and we do what we’re good at. I think people lose sight of that a lot of the time because they want the machines to do as much as possible.

Interested in distributed data science?

We hope you enjoyed the first installment of our conversation with Andy. Keep your eyes peeled for the next installment on tools for distributed ML, coming soon.

For more distributed machine learning and data science, check out our hosted notebook examples on Coiled Cloud by signing up below.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud