The Unbearable Challenges of Data Science At Scale

Hugo Bowne-Anderson
July 13, 2020

Scaling Data Science is a Team Sport

An increasing number of organizations need to scale data science to larger datasets and larger models. However, deploying distributed data science frameworks in secure enterprise environments can be surprisingly challenging because we need to simultaneously satisfy multiple sets of stakeholders within the organization: data scientists, IT, and management.

Solving simultaneously for all sides of this problem is a cultural and political challenge as much as a technical one. This is the problem that we’re passionate about solving at Coiled, and that we recently spoke about in our PyCon 2020 talk.

Here at Coiled, we have been speaking with data scientists, management, IT, and open source developers (among others) about the challenges of scaling data science to both the cloud and on-premise clusters.

Power Rangers teamwork GIF

In order to open up this necessary conversation, we’ve published posts detailing the challenges encountered by data scientists, IT, and team lead. The intention of this post is to list all the challenges together and allow you, the reader, to dive into the posts that interest you the most (hint: all of them).

The Pain Points of Scaling Data Science

We often see the paint points felt by data scientists boil down to three main challenges (for more depth, read our detailed post here):

  1. Software: Do my machines all have the same software installed? Can I upgrade packages easily?
  2. Resource sharing: Can I share these same machines with my team? How quickly can I get 100 machines, even if only for a few minutes?
  3. Data Access: Where is my data? Can my machines access it too?

We often see the paint points felt by team leads and management boil down to three main challenges (for more depth, read our detailed post here):

  1. Avoid Costs: What stops a novice leaving 100 GPUs idling?
  2. Track and Optimize: Where are we spending money and how can we reduce this?
  3. Enable Collaboration: How do we replicate the experience of our top performers, and enable them to raise up the outputs of the entire team?

We often see the paint points felt by IT departments boil down to three main challenges (for more depth, read our detailed post here):

  1. Predictability: Can I ensure that critical production workloads will continue unimpeded?
  2. Security: Is our sensitive data appropriately protected from external or internal threats?
  3. Observability: Can we see what’s going on?

Final thoughts

These are the types of data science problems we’re building products to solve for here at Coiled. The truth is we’re really excited to be building products for scaling data science in Python to larger datasets and larger models, particularly for data scientists and teams that want a seamless transition from working with small data to big data. If the challenges we’ve outlined resonate with you, we’d love it if you got in touch with us to discuss our product development.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud