Scale your data science workflows with Python and Dask

April 23, 2021

Data Scientists are increasingly using Python and the Python ecosystem of tools for their analysis. Combined with the growing popularity of big data, this brings the challenge of scaling data science workflows. Dask is a library built for this exact purpose - making it easy to scale your Python code, and serve as a toolbox for distributed computing!

In this post, we will discuss:

  • Why use Python for Data Science?
  • Why is there a need for scalability?
  • How does Dask make scaling simple, and where is it used?

Learn more about Python, Dask, and scalable compute in our comprehensive guide: Scaling Data Science in Python with Dask.

Hugo Bowne-Anderson, the Head of Data Science Evangelism and Marketing at Coiled, introducing the new guide - Scaling Data Science in Python with Dask.

Python for Data Science

Python’s wide adoption is attributed to three key factors:

  1. It has a more readable and human-friendly syntax when compared to a lot of other programming languages.
  2. It has a free and open source foundation, which allows more people to learn, teach, and improve the language.
  3. It has an active and growing community around it. There are numerous supporting tools and libraries that make Python one of the most powerful languages to use, including for data science.

Python has made Data Science more accessible to everyone from researchers to analysts to students:

  • Libraries like NumPy, pandas, scikit-learn, XGBoost, Matplotlib, etc. are free and open source alternatives to proprietary data science tools like MATLAB. 
  • Tools like Dask also help reduce reliance on high-cost infrastructure like supercomputers for scalability. 
  • The Jupyter ecosystem makes collaboration easy by providing executable and shareable environments. 
  • New specialized libraries are being developed everyday, adding more layers of abstraction, and speaking the language of a specific community like Astropy for Astrophysics, SunPy for Solar physics, Biopython for computational biology, and many more.

The Scientific Python community is growing rapidly, hence we also have a growing pool of resources to support new users!

Scaling Data Science Workflows

Recent advancements in technology, algorithms, and computational power has led to a “big data revolution”. Data scientists require scalable solutions to work with these large and complex datasets, and they look towards parallel and distributed computing.

Scalable computing can be anything that extends beyond a single thread on a single core. Parallel computing is the process of computing multiple tasks simultaneously to accelerate workflows, and distributed computing is the process of leveraging computing resources of multiple machines. Learn more about these concepts in Scaling Data Science in Python with Dask!

This image shows the differences between a parallel computing system and a distributed computing system. A parallel computing system consists of multiple processors that use shared memory to communicate with each other. A distributed computing system consists of multiple machines, each with its own CPU(s) (central processing unit), that are connected by a communication network.
Source: Scaling Data Science in Python with Dask

Another important concept in scalable computing is cloud computing. Providers like AWS, Azure, and GCP, provide a lot of computing resources that can be accessed from anywhere. Cloud computing involves leveraging these resources to perform big computations.

Scaling made simpler with Dask

Dask is a library for parallel and distributed computing in Python. Dask makes it easy to use all resources on your local machine, set up distributed computing environments, as well as scale to the cloud. It’s familiar API, flexible design, and synergy with the Python ecosystem make Dask the tool of choice for both individuals and institutions.

Dask is used by industry leaders from Walmart and Capital One, to government agencies like NASA and the UK Met Office, among many others to scale their data science and machine learning pipelines. Dask is also used behind-the-scenes in many popular tools including Nvidia RAPIDS, Apache Airflow, and PyTorch!

Dask is a very powerful library, but scaling Dask computations to the cloud involves many DevOps challenges like networking, security, and environment configurations. Coiled Cloud is a product built by the creators and maintainers of Dask, that takes care of these challenges, so that data scientists can focus on their analysis. Coiled lets you scale to the cloud in just one-click, while also providing essential features for team and cost management.

If you’re getting started with scalable computing, want to learn more about Dask and Coiled, or if you’re just curious about high-performance computing, check out our complete guide by clicking the link below!

Read Scaling Data Science in Python with Dask

With GitHub, Google or email.

Use your AWS or GCP account.

Start scaling.

$ pip install coiled
$ coiled setup
$ ipython
>>> import coiled
>>> cluster = coiled.Cluster(n_workers=500)