Science Thursday: Design Principles of Distributed Systems

Hugo Bowne-Anderson
October 26, 2020

Holden Karau joins Matt Rocklin & Hugo Bowne-Anderson to discuss the design of Dask, how it compares to PySpark, and why these tradeoffs were chosen.

There are many different distributed systems solving what at first glance might seem like "the same problem." Here we'll talk about the skeletons in the respective closets of our different distributed tools, and why we choose to let some ghosts stay, and which ghosts were attempting to banish.

We're fortunate to have a great many different distributed data processing tools, ranging from Hadoop Map Reduce, Spark, Kafka Streams, Ray, Dask, and more. The design of each of these systems makes it better suited to different types of problems. Holden will bring her years of experience working on "big data" (but not representing the project or anyone else), as well her more recent explorations in the Python-specific tooling beyond Spark (you can follow her along on YouTube and a slightly slower to update blog). No live stream would be complete without the guest attempting to pitch you her latest book, and Holden promises to try and convince you that "Kubeflow for Machine Learning" is somehow a spooky holiday gift.

After attending, you’ll know:

  • The trade-offs Dask, PySpark, and MR make & what they mean for you,
  • How to choose the right abstraction layer for your problem,
  • How and when to banish the JVM from your system,
  • Different approaches to dependency management, and 
  • The name of Holden's new puppy & the puppy's upcoming Halloween costume.

Level up your Dask using Coiled

Coiled makes it easy to scale Dask maturely in the cloud