An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.
Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.
In this post, we’ll discuss the pain points felt by data team leads and management when trying to deploy data processing technologies to provide data scientists with distributed computing. In other posts, we do the same for data scientists and for IT professionals.
We often see the pain points felt by team leads and management reduce down to three main challenges:
We’ll call out these challenges in each of the sections below.
Be careful what you wish for with “infinite scaling” on the cloud.
Fundamentally, we’re transforming our company by giving as much computing power as possible to every data scientist. To use a military analogy, this is like giving a tank or fighter jet to every person in the military. This is amazingly powerful, but may result in a surprisingly expensive fuel bill.
Most distributed computing costs are avoidable. Here are a few common culprits:
These issues are straightforward to address, but they do need to be addressed. We can implement per-user and per-group usage limits, and expose per-user and per-group usage metrics to management. We can also perform distributed profiling and see how much money we’re spending on every line of code that we’re running. This lets us make better management decisions to reduce costs.
Slide from our PyCon 2020 talk “Challenges of Deploying Distributed Computing”
The irony is that the practice of data science is not itself terribly data driven today.
Tracking and profiling aren’t just useful for reducing costs, they also help us make data driven decisions about our processes. For smaller data science teams this is easy. Typically there is a team of 1-5 people with a central team leader who has a firm grip on what everyone is doing. However, for larger organizations we often find that there is a strong desire to know what is going on, especially in a messy field like data science. The irony is that the practice of data science is not itself terribly data driven today.
This becomes more challenging when we add scalable data science tools, both because environments and techniques churn quickly, and because tracking and profiling distributed services is hard.
There are great opportunities here though, and we’re really excited about enabling questions like the following:
These questions let larger organizations tune and optimize their entire data science division. This is rarely something that anyone has done well yet today, but it is a common topic of conversation.
How do we turn one 10x engineer into ten 10x engineers?
Collaboration is about enabling bottoms-up team management.
Distributed computing services are often brought into organizations by individual highly effective early adopter data scientists (or at least this is our experience with Dask). These individual contributors invariably share the experience with their colleagues, and end up serving as technical leads and informal devops for a while.
But these early adopter individual contributors need help in order to quickly uplevel their teammates. They need to craft software environments for everyone else to use. They need to track what their colleagues are doing to help them diagnose performance issues. They need to connect to their clusters and help them debug sticky situations. This can be tricky, especially in today’s world of remote work.
Collaboration isn’t a single feature. It’s a suite of small features that are designed around this relationship of skill-sharing. Doing it well helps to amplify effective engineers, while also shortening the time it takes for novice data scientists to become experts.
These are the types of data science problems we’re building products to solve for here at Coiled. The truth is we’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for organizations and data teams that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.