Parallel computing is important for analyses that run on large datasets, whether the computations take place on your local machine or in the cloud.
Real world datasets are growing fast and personal laptops sometimes aren’t powerful enough to handle the amount of data that needs to be processed. Data engineers, analysts, and scientists need to use parallel computing to split up computations so they can execute on large datasets in a reasonable amount of time.
This post explains what it means to scale up a computation with parallel computing on a single machine. It then talks about scaling out computations to multiple machines in a cluster with parallel computing (aka distributed computing).
Let’s look at some high level data sizes and brainstorm if they can be run on a personal laptop with 32GB of random access memory (RAM) and a 500GB solid state drive:
Merely executing a computation isn’t always enough, especially if the performance is painfully slow. Suppose you have a serial (aka sequential) computation that takes 8 hours to complete and you’re forced to run this as an overnight job. You might want to run this computation with a parallel algorithm so you can get results in, let’s say, 20 minutes. This would mean faster iteration on analyses and machine learning models and more value for your customers and/or business leaders.
Parallel computing isn’t a niche area of data analysis anymore. Lots of modern, real-world datasets are huge. You need to leverage parallel computations to run your analyses and get results without waiting too long.
The goal of this post is to give you a high level intuition of some parallel algorithms and how they’re executed on computer hardware. The description is intentionally high-level and glosses over the low-level details.
The complete picture is quite complex and you can feel free to dive down the rabbit hole after reading this post. For most readers, a high-level understanding will go a long way when you’re using distributed computation engines.
Most computers have multiple cores that can process computations in parallel when the program is properly structured. If the program is structured to run on a single core, then only a fraction of your machine will run computations while the other cores sit idle.
This diagram demonstrates a computation that’s been structured on a computer with one CPU and four cores. One of the cores does all the work and the other three cores just sit idle.
You aren’t getting the most out of your computer when cores are sitting idle and aren’t helping execute your computations.
Popular data processing technologies like Excel and pandas don’t allow you to easily run computations on multiple CPUs.
Excel allows you to check a “use all processors on this computer” button, but that only helps if the underlying algorithm has been structured to actually leverage all the cores.
pandas computations are also structured to run with one CPU by default. Advanced programmers have overcome this pandas limitation by creating instructions to process data in different pandas DataFrames by different cores in parallel.
If you’re an expert programmer with experience in distributed systems, you can write the code to parallelize Python yourself… but do you really want to spend your time writing all the low level code that you can access for free with a library?
A regular pandas analysis only uses one core to process the data.
Dask makes it possible to leverage all the cores on your machine when running computations. It can split the single pandas DataFrame into four pandas DataFrames so each core can run computations on a subset of the data at the same time.
See the Computers are not as parallel as we think blog post for more details on computer hardware and additional complexities that aren’t captured in these diagrams.
Let’s look at how leverage all the cores of a machine is especially useful in high performance computing environments.
Some companies and research institutions have supercomputers with a lot more processors, RAM, and hard drive space than a normal computer. A supercomputer may have 1,000 cores, or more.
It’s particularly wasteful to run a computation that only uses one core on a supercomputer: it will only use one core and leave the other 999 cores idle.
A parallel computation technology like Dask is often used in high performance computing environments because it makes it easy for organizations to take advantage of their powerful hardware by running computations on all available cores.
Dask isn’t limited to running on a single computer of course. It can also be run on a network of computers, called a cluster, to scale beyond the limits of a single machine.
You can also execute parallel algorithms on a network of computers that run computations in parallel.
The following diagram shows a cluster of three computers, each with one CPU, that are connected via a network.
Suppose you have a large tabular dataset. Dask DataFrames can split up the data into several pandas DataFrames, each sliver of data in different computers, to run parallel computations on a cluster.
Parallel computing on a single machine and parallel computing on a cluster both attack the big data scaling issue with a similar methodology. They’re both trying to split up datasets and process the data with more than one core, in parallel, so the computations run faster and can be scaled.
Single machine parallel computing leverages all the cores of a given machine. Distributed computing leverages all the cores of multiple computers.
Running computations on a cluster of computers seems more complicated than running computations on a single computer with multiple cores. Cluster computations add additional networking complexities like how the different computers in the cluster communicate with one another. So why are lots of organizations transitioning from single machine analyses to cluster computing, even when it’s more complex?
Different analyses have different computational requirements. Some computations need one core, others need 40, others need 10,000.
An organization with a collection of smaller computers can easily create clusters with different amounts of computational power for different analyses. Suppose an organization has 5,000 machines, each with 4 cores.
A 50 person data organization can easily share the 5,000 computers, spin up clusters that meet their needs, and run their analyses at the same time.
If the organization only has one 20,000 core supercomputer, then it’s harder for the organization to share the computation resources. They need to form a queue and wait for their coworker to finish running their computations before they can launch their own analysis.
Maintaining computers on premises adds a large maintenance burden for organizations. They need to install updates, maintain power, check security, etc. Many organizations have switched to the cloud, so the burdensome task of maintaining servers is offloaded to third parties.
Most cloud providers let organizations rent computing resources by the second, so they can scale their computations whenever necessary and avoid paying when they’re not running analyses.
Companies don’t want to invest significant resources in buying computers and maintaining them onsite. Some of them have supercomputers they purchased years ago that are now outdated and are sitting collecting dust. They’d rather offload the responsibility of buying new machines to a third party.
Cluster computing with resources rented in the cloud is how most organizations are moving forward for their big data, parallel workflows. It’s the most economical, maintenance-free, and flexible path.
Data professionals usually don’t care about where the computational resources are sourced or how the parallel computation algorithms work under the hood.
Data analysts want to run queries and generate reports. Data scientists want to train models. Data engineers want to build ETL pipelines. They’re interested in solving business problems and don’t want to worry about low-level details, like managing servers or how computations are executed in parallel.
Data professionals want to use parallel computing under the hood to run their computations in a performant manner, but it’s not something they want to actively think about. They’re sufficiently preoccupied with solving problems in their business domain.
Dask strikes the perfect balance for lots of data professionals. It offers high level abstractions so you can do your job, run algorithms in parallel, and get fast results. Dask also offers the low level abstractions for the power users that want to dive into the details and write their own custom parallel algorithms.
Running algorithms in parallel is the necessary path forward for most data professionals because modern datasets are huge and parallel processing is the only way to get results processed in a reasonable amount of time. Let’s dig deeper into the details of parallel computing and what makes it complex.
Parallel computing is hard because certain algorithms are hard to parallelize. This section demonstrates algorithms that are easy to parallelize and others that aren’t easy to run when data is distributed to multiple nodes in a cluster.
Most data professionals don’t care about how parallel computations actually execute under the hood. They just want their data to be processed in parallel, so they can get results fast.
We sympathize with these users that aren’t interested in “how the sausage gets made” and just want to use a library that takes care of all the parallel computation details. This is the right attitude most of the time. You should focus on your business domain, not low level computation details.
It’s good to get a high level conceptual grasp of parallel compute challenges, even if you’re relying on a library, so you can easily identify the types of computations that cause bottlenecks. A little parallel computing intuition goes a long when for debugging and performance optimizations.
The examples will use simple arrays. Let’s start by showing how to count all the elements in an array with a single threaded process and then how to run the same computation in parallel.
Suppose you would like to count the total number of elements in the following list: [1, 23, 2, 4, 19, 6, 5]
We can easily iterate over all the elements in the list and increment a variable to count all the elements. Here’s some code that’ll work:
i = 0
for element in elements:
i += 1
nums = [1, 23, 2, 4, 19, 6, 5]
count_elements(nums) # 7//]]>
Let’s look at parallelizing this computation by splitting up the data on multiple computers and running the counts in parallel.
Suppose the data in the array is split up and put on three different computers.
We suddenly have many CPUs (great!) but also have many different blocks of memory and disk, which can only be read by certain CPUs. Now there is a lot more need to coordinate who does what. This coordination is the first thing that makes distributed execution hard.
We need to restructure our algorithm now that the data is distributed on multiple machines. Here’s how the new computation can be structured:
Let’s visualize this computation.
One computer will run count_elements([1, 23]). At the same time another computer will run count_elements([2, 4, 19]). Also at the same time another computer will run count_elements([6, 5]).
After all those computations are finished, the counts will be collected on a single machine and summed.
If you’re counting all the elements in a really big list, parallel computing can help. Having three computers perform the computation at the same time will generally be a lot faster. But will parallel computing help in this example with a list that only contains 7 elements?
Parallel algorithms have additional overhead compared to single threaded process algorithms:
For small lists, the parallel algorithm that’s run on a cluster of computers will actually run slower. The extra steps will add more computation time than what’s saved from executing the algorithm in parallel.
The parallel algorithm also adds complexity of course. There’s no reason to add additional complexity if your dataset is small and the scaling benefits / performance gains from parallel computing don’t meaningfully improve your analysis. Don’t use parallel algorithms unless they’re needed.
Let’s look at how to compute the number of distinct elements in a list.
Suppose you have the following list with some duplicate elements: [a, b, c, a, a, d, b].
Here’s a function that’ll return the unique elements in the list.
s = set()
res = 0
for i in range(len(elements)):
if (elements[i] not in s):
res += 1
my_elements = ["a", "b", "c", "a", "a", "d", "b"]
count_distinct(my_elements) # 4//]]>
The count_distinct() algorithm works fine when run as a single threaded process.
Let’s think about how to scale this computation when it’s run in parallel on a cluster.
Let’s visualize how the data could be split on a three node cluster.
It’s not immediately obvious how we can parallelize the count_distinct() computation. When calculating the sum of total elements in the list, we were able to run count_elements() on each node in the cluster and then calculate the final result by summing all the intermediate results. We can’t sum the intermediate results to get the final result for a count distinct computation because that’d give us the wrong result (that’d return 6 instead of 4).
You could rework the algorithm to return a list of the distinct elements in each list, collect all of the distinct elements on a single node, and then run count_elements(), but that wouldn’t scale well for all datasets. You’re splitting the data on multiple nodes because it’s big. Perhaps the data is too big to fit on a single computer. If the array doesn’t contain any duplicate entries, then this algorithm would attempt to collect the entire dataset on a single node and would error out.
We’re not going to discuss the details of how all these computations actually work under the hood. For now, it’s just important for you to understand that algorithms need to be structured differently when they’re run in parallel and that parallelizing an algo isn’t always easy.
Dask provides users with easy access to prebuilt race cards so they can run parallel computations without worrying about any of the underlying details. The Dask engine figures out how to distribute the data on the different machines in the cluster and run the computations in parallel.
This lets the user run parallel computations without worrying about any of the low level details.
Most problems can be solved with the Dask DataFrame, Dask Array, and Dask machine learning abstractions. There are other Dask APIs that let you write parallel code if you’re so inclined.
One of the unique features of Dask compared to other parallel compute libraries is that it exposes low level APIs so you can write your own parallel algorithms.
The flexibility offered by Dask is powerful. Many Dask users have constructed complex, custom parallel compute engines with Dask with solutions customized for their business needs.
Writing parallel algorithms with low level Dask APIs isn’t easy. As we’ve seen, even relatively simple “count distinct” algorithms become much more complex when run on a cluster environment.
Some Dask users will never need to write their own parallel code. They can use the prebuilt racecars to solve all their problems. Other power users will want to use the prebuilt racecars in addition to building their own custom parallel compute code. This flexibility is one of the reasons Dask is loved by its users.