Rodrigo and Felipe Aramburu, the brothers that lead BlazingSQL, recently joined us to discuss how they are empowering folks around the world to do GPU-accelerated data science in Python...with SQL!
There’s an entire community of data analysts out there that need to become programmatically proficient for their long-term career trajectory. And for us, SQL is a great onramp for enabling these individuals to become dangerous and powerful relatively quickly.”
In this post, we’ll summarize the key takeaways from the stream. We cover:
You can find the code for the session at github.com/BlazingDB/blazingsql in the intro_notebooks folder.
Rodrigo (CEO) and Felipe (CTO) shared a little bit about themselves at the top of the stream:
GPU-accelerated SQL in Python? Sounds ridiculous, right?
Felipe responded, “It is very easy to write SQL. Writing Python is wonderful and great but it’s a little bit harder. BlazingSQL opens up GPU-accelerated data science to more people. They can start in SQL and then when they want to do something a little bit fancy, they can move up to Python.”
BlazingSQL is a SQL engine on DataFrames.
So how does BlazingSQL do GPU-accelerated SQL in Python? In Rodrigo’s own words,
“BlazingSQL is a SQL engine on DataFrames. DataFrames are a really easy way to handle and manipulate a table of data inside memory in Python. And BlazingSQL allows you to run a SQL query on that DataFrame.”
The DataFrame that BlazingSQL runs on is cuDF, which is the CUDA DataFrame library inside RAPIDS. It’s a GPU-accelerated DataFrame and looks similar to pandas.
Rodrigo hopped into a quick example.
“We’re just reading a CSV into memory. So this DataFrame now exists inside GPU memory. We can create a table off that that has a name on it and we can run whatever kind of SQL query on it we want. It’s ANSI SQL compliant. Runs entirely within Python. It’s a Python package that has low level C and C++ bindings to it, which is why it’s so high performant.”
“Now we can extend that where maybe we’re not creating a SQL table off an in-memory GPU DataFrame. Maybe we’re creating it off a Dask cuDF, so a distributed or partitioned cuDF. Maybe we’re creating it off of a pandas DataFrame, or a Dask DataFrame on pandas. Maybe you want to go all the way down to files like Apache Parquet files, CSV files, what have you. And maybe those files don’t exist inside the server or computer that you’re running, maybe they exist in the data lake. So how can we create a SQL engine that runs directly off of files and in-memory DataFrame representations that has really high performant query execution in Python.”
Matt Rocklin hopped in, “One of the common requests we get for Dask is, ‘Hey, do you support SQL? I love that [with Dask] I can read in some data from my S3 bucket, that I can do some custom Python manipulation, but then I want to hand it off to a SQL engine and send that to Tableau or something.' And my answer has always been, ‘No, there is no good SQL system in Python.’ But now there is—if you have GPUs.”
That’s where our engine starts making a lot of sense.
Felipe shared two exciting customer use cases:
The second bullet is a trend the Coiled team is seeing too, most recently in our live stream and blog post with Alex Egg from Grubhub.
Even if I don’t know pandas, all of a sudden Python and [data visualization] is opening up to me because I know SQL.
Rodrigo started, “One of the problems we’re seeing for people wanting to use GPUs is actually gaining access to a GPU. A lot of people might not have one inside their laptop and they might not know how to spin one up in the cloud. So we created a simple free environment that anyone can use at app.blazingsql.com.” You can use it to follow along. They start in the data_visualization.ipynb notebook in the intro_notebooks folder.
You pass Dask to BlazingSQL so BlazingSQL knows where the workers are.
Coiled, among other things, provides hosted and scalable Dask clusters. Here’s how Rodrigo leverages Dask:
Rodrigo: “That’s exactly right. You pass Dask to BlazingSQL so BlazingSQL knows where the workers are.”
Matt: “You will never know the pain of creating software that interacts with every single possible cluster management system.” Note: Matt is a core developer of Dask.
Rodrigo: “Oh we do! That’s why we didn’t do it! Before BlazingSQL, we were building an old school DBMS that was called BlazingDB. One time I asked Felipe, ‘For this to be effective, I need to be able to deploy a BlazingDB cluster on my own relatively easy.’ And his response was, ‘You will never be able to deploy it. This is only for the hardcore of the hardcore.’ So yeah, we appreciate the work y’all do at Dask.”
BlazingSQL’s vision, in Rodrigo’s words:
“The PyData ecosystem is empowering for data scientists, but there’s an entire community of data analysts out there that are becoming programmatically proficient or need to learn to be for their long term career trajectory. And for us, SQL is a great on-ramp for enabling that for these individuals that have these capabilities to become dangerous and powerful relatively quickly. As opposed to having to start figuring out a new paradigm of data manipulation, which can be pretty vexing if you’ve never seen it before.”
We agree. We’re grateful for Rodrigo’s and Felipe’s time.
If you need to do data science at scale, you too can get started on a Coiled cluster immediately. Coiled also handles security, conda/docker environments, and team management, so you can get back to doing data science.