Gil Forsyth and Mike McCarty joined us recently for a Webinar on Distributed Machine Learning at Capital One. Gil is a Senior Manager of Software Engineering and Mike is a Director of Software Engineering at Capital One. They discussed how Capital One scales machine learning workflows, the challenges of distributed computing in an industry setting, and their learnings from using Dask. You can access the webinar recording by filling out the form below.
In this post, we will cover:
Capital One has a large community of data scientists who come from diverse backgrounds like physics and statistics. Most of these data scientists use Python as their daily-driver language, but a small subset of them also use JAVA, Scala, R, and Spark.
Data Scientists at Capital One have the following options to scale their workflows:
For a long time, teams scaled their workflows by rewriting it. Imagine data scientists training their models in scikit-learn, then handing it over to someone else, to implement it in Spark using JAVA. This is an old practice and has largely gone away, but some companies still follow it. This process has a lot of unwanted risks. It becomes a source of friction when Spark and the PyData ecosystem do not get along, and sometimes distributed architecture is so different that it’s not even the same as the original model. Capital One wanted a solution to robustly scale their Python computations inline with the existing PyData stack, that’s where Dask came in.
Gil says not everyone needs scalable machine learning. There are two ways to look at scalable computing:
Memory constraints are comparatively easier to overcome because many cloud instances provide a lot of memory, but CPU constraints require distributed computing, this is challenging, but also worth it. The next logical question is, how do you move to distributed computing? It’s a two-part answer: people and infrastructure.
People are going to need to do both machine learning and distributed computing. A big part of moving to a distributed computing workflow is educating people. Capital One held internal tutorial sessions that had good turn-out rates. This shows that there is an appetite to learn Dask.
It’s important to train people because the distributed computing paradigm is very different from local computation, there are very different costs to keep in mind. Learning some under-the-hood details is valuable for predicting and tuning performance. Gil says:
“I call it the 'original sin of databricks', they made spinning up a spark cluster super easy!”
There are many ways to deploy Dask. Capital One uses SSH, YARN, Kubernetes, and sometimes even manual deployments. They also use AWS EMR and Dask Cloud Provider. Maintaining all of these is difficult, so services that encapsulate all of these are getting increasingly popular, like Coiled Cloud!
Gil talks about Capital One’s story of their RAPIDS Infrastructure on AWS -- a small number of nodes gave them massive compute!
Parquet is a very powerful file format for distributed computing. Most data in Capital One is a tabular format and Parquet is perfect for this scenario. It is very close to a universal serialization format. Gil says this is an interesting place to insert Dask into your workflow, it’s not uncommon to have ETL in Spark and then pick from there using Dask.
There are some pain points to be aware of though. Spark and Dask both follow the parquet spec, and the problems lie in the:
Watch the webinar recording to learn more about how this combination is used by the team!
The first step while scaling is always to understand your problem, is it CPU bound or memory bound? It’s important to consider these because the workflow is different for each case.
Exhaustive grid search over a bunch of folds is a CPU-bound problem. You can use Dask with a joblib backend for this and train all folds in parallel. On the other hand, training with more data is a memory-bound problem. Here, instead of regular XGBoost, we use distributed XGBoost.
At this point, you might be curious if we can do both at the same time? It is possible, but very difficult because the above methods aren’t designed to work together optimally. But, it can be accomplished by adding a new layer of orchestration using tools like Prefect.
Let’s look at another example - adopting the scikit-learn API. Generally, you will be working through these three paradigms: single core workflow on your local machine, then using Dask for leveraging all your CPU cores, then using RAPIDS for working with GPUs:
Key points to consider here are:
In the webinar recording, Mike introduces practical and useful checks the team uses as guidelines for which data structure to use in a given situation.
Mike goes on to discuss different options to scale with Dask. He also uses Coiled to scale to the cloud and demonstrates some checks to verify which paradigm to use. You can follow along on the supporting notebooks and learn more in their blog post: Custom Machine Learning Estimators at Scale on Dask & RAPIDS!