BlazingSQL review: Fast ETL for GPU-based data science

BlazingSQL builds on RAPIDS to distribute SQL query execution across GPU clusters, delivering the ETL for an all-GPU data science workflow.

BlazingSQL review: Fast ETL for GPU-based data science
Little Visuals (CC0)
At a Glance
  • BlazingSQL V0.17

BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem. BlazingSQL allows standard SQL queries to be distributed across GPU clusters, and the results to be fed directly into GPU-accelerated visualization and machine learning libraries. Basically, BlazingSQL provides the ETL portion of an all-GPU data science workflow.

RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format. CuDF, part of RAPIDS, is a Pandas-like DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data on GPUs.

For distributed SQL query execution, BlazingSQL draws on Dask, which is an open source tool that can scale Python packages to multiple machines. Dask can distribute data and computation over multiple GPUs, either in the same system or in a multi-node cluster. Dask integrates with RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.

BlazingSQL is a SQL interface for cuDF, with various features to support large-scale data science workflows and enterprise datasets, including support for the dask-cudf library maintained by the RAPIDS project. BlazingSQL allows you to query data stored externally (such as in Amazon S3, Google Storage, or HDFS) using simple SQL; the results of your SQL queries are GPU DataFrames (GDFs), which are immediately accessible to any RAPIDS library for data science workloads.

The BlazingSQL code is an open source project released under the Apache 2.0 License. The BlazingSQL Notebooks site is a service using BlazingSQL, RAPIDS, and JupyterLab, built on AWS. It currently uses g4dn.xlarge instances and Nvidia T4 GPUs. There are plans to upgrade some of the larger BlazingSQL Notebooks cluster sizes to A100 GPUs in the future.

In a nutshell, BlazingSQL lets you ETL raw data directly into GPU memory as GPU DataFrames. Once you have GPU DataFrames in GPU memory, you can use RAPIDS cuML for machine learning, or convert the DataFrames to DLPack or NVTabular for in-GPU deep learning with PyTorch or TensorFlow.

BlazingSQL architecture

As we can see in the figures below, BlazingSQL integrates SQL into the RAPIDS ecosystem. The first diagram shows the BlazingSQL stack, and the second diagram shows how BlazingSQL fits with other components of the RAPIDS ecosystem.

Looking at the first diagram, BlazingSQL connects to Apache Calcite via JPype, and uses it as a SQL parser, to create a relational algebra plan from a SQL string. The Relational Algebra Engine (RAL) handles all the smarts of creating a distributed homogenous execution graph to let every worker know what it needs to process. It also helps manage query execution at runtime such as estimating memory consumption (across GPU memory, system memory, and disk memory), in order to manage queries that require out-of-core processing.

The best way to think about it is that RAL is the brains of the engine. Everything above it is a thin client, and what’s below it are compute kernels and underlying libraries.

blazingsql 01 BlazingSQL

The Relational Algebra Engine (RAL) is the brains of BlazingSQL. It handles all the smarts of turning the relational algebra plan from Calcite into a distributed homogenous execution graph to let every worker know what it needs to process.

blazingsql 02 BlazingSQL

BlazingSQL turns SQL queries against tabular data into GPU DataFrames. From there you use the components of RAPIDS to prepare the data, perform machine learning, and create graph analytics.

BlazingSQL API

The Python class BlazingContext implements BlazingSQL’s API. The simplest use is for a single GPU:

from blazingsql import BlazingContext
bc = BlazingContext()

For multiple GPUs in a single node, you need to use a LocalCUDACluster:

from blazingsql import BlazingContext
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
bc = BlazingContext(dask_client = client, network_interface = 'lo')

For multiple nodes, you need to refer to a running Dask scheduler by its network address:

from blazingsql import BlazingContext
from dask.distributed import Client
client = Client('123.123.123.123:8786')
bc = BlazingContext(dask_client = client, network_interface = 'eth0')

The network interface for a cluster will vary depending on the environment. On AWS, it’s likely to be ens5. On the BlazingSQL cloud service, the correct IP address and network interface of the Dask scheduler will be filled in for you in the distributed welcome notebook.

Once you have a BlazingContext instance, you can call its methods to create, manage, and query tables:

# create table
bc.create_table('table_name', '/home/user/table_dir/*')
# define a query
query = 'select * from table_name limit 10'
# explain how the query will be executed
print(bc.explain(query))
# query table
dask_cudf = bc.sql(query)
# display results
print(dask_cudf.head())
# drop table
bc.drop_table('table_name')

Note that the distributed query processing is hidden from you. The only time you need to think about it is when you create the BlazingContext.

BlazingSQL SQL

BlazingSQL uses Apache Calcite, the industry-standard SQL parser, validator, and JDBC driver for SQL language. BlazingSQL itself implements a subset of what Calcite can handle. For example, it includes four widths of INT and FLOAT, VARCHAR, two widths of DATE, and TIMESTAMP, but no BOOLEAN, DECIMAL, CHAR, BINARY, or GEOMETRY types. It supports SELECT with most of its subclauses, but no INSERT, PIVOT, or any other DML or DDL.

You can, however, manipulate data in a CUDA DataFrame using CuDF Python APIs, and do even more if you convert to it to a Pandas DataFrame. Unfortunately, a Pandas DataFrame resides in CPU RAM, not in GPU memory.

Most of the common SQL functions for the types listed are supported, or at least mentioned in the documentation. There are CAST functions for more data types than are listed as supported; I didn’t test to see whether they work or how they fail if they don’t work.

Using the BlazingSQL Notebooks service

BlazingSQL Notebooks offers guides to all of its functionality, from creating clusters to writing SQL and Python. The BlazingSQL Notebooks main user interface has tabs for clusters, environments, credits, and documentation. Here I have not yet created any private GPU clusters, but have some credits so that I can do so.

blazingsql 04 IDG

BlazingSQL Notebooks run on cloud instances with GPUs. A single GPU is free. A GPU cluster costs 1 credit/GPU/hour. A credit currently costs $0.75.

Rapids Stable will be my default environment.

blazingsql 05 IDG

There are two possible default environments for BlazingSQL Notebooks. Rapids Stable could be up to six weeks old, but has been tested. Rapids Nightly, as you might expect, is the latest version of RAPIDS.

Here I’m creating a “medium” four-GPU cluster. Each node is a g4dn.xlarge instance with a T4 GPU. G4dn instances have 16 GB of memory and deliver up to 65 TFLOPs of FP16 performance, so this cluster should be capable of up to 260 TFLOPS using FP16 data, about half that using FP32 data, and about twice that using FP8 data.

blazingsql 06 IDG

When you create a private GPU cluster you can choose a name, the size of the cluster, the auto-suspend time, the region, and the environment.

Once my cluster is created and running, I can use the rocket links to launch JupiterLab, and I can also view the Dask dashboard for the cluster.

blazingsql 07 IDG

Once you’ve created a cluster, you can start and stop it at will. It takes several minutes to create a cluster, and several minutes to start one.

BlazingSQL examples (notebooks)

While I went through all of the introductory notebooks, I’ll only show you a few selected screenshots. The code is either Python or SQL; the SQL tends to be in Python strings, and query results are cuDF DataFrames. The next screenshot has the most interesting graphics of the bunch, and takes advantage of the Datashader visualization tool, which supports GPUs and cuDF DataFrames.

I have already loaded some NYC taxi data into a table named taxi from a CSV file. bc is a BlazingContext, pretty much as I described in the API discussion above.

blazingsql 08 IDG

The Datashader package does in-GPU data visualization. Here we see a heat map of taxi drop-offs from rides starting in Manhattan.

Here I’ve shown a later section of the same notebook in which we’re querying the taxi data twice: x contains the features that should affect the fare, and y is the target, the fare amount we want to predict. We’re using a simple least squares linear regression model from cuML, which is essentially the same as Scikit-learn Linear Regression.

blazingsql 09 IDG

This example is at the end of the distributed welcome notebook. We can see the SQL queries in step 17, and the in-GPU machine learning in steps 18 and 19. Since cuML doesn’t implement all of Scikit-learn, we need to convert the data from in-GPU cuDF format to Pandas format to use the sklearn r2_score function in step 20.

Beyond the Welcome notebook, you can run the other notebooks shown in the screenshot below. A few of them can benefit from running a private cluster, but most will run fine on a free single-GPU instance.

blazingsql 10 IDG

There are five introductory BlazingSQL Notebooks. We’ve seen much of the distributed version of the Welcome notebook. The other notebooks zero in on the cuDF DataFrame, data visualization, cuML machine learning, and finally a live version of the examples from the documentation.

To summarize, BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem. The BlazingSQL code is an open source project released under the Apache 2.0 License. The BlazingSQL Notebooks site is a service using BlazingSQL, RAPIDS, and JupyterLab, built on AWS.

Using Dask and a couple of lines of code, BlazingSQL and BlazingSQL Notebooks support multiple GPUs on a single node and clusters of multiple nodes. Once you’ve selected the data you want with BlazingSQL and gotten it into cuDF GPU DataFrames, you can use cuDF APIs to manipulate the data. If you encounter missing methods in cuDF, you can convert the data to a Pandas DataFrame and process it using Pandas methods in normal RAM.

You can perform some data visualization and cuML machine learning entirely in the GPU on cuDF DataFrames. You can also convert the DataFrames to DLPack or NVTabular for in-GPU deep learning.

If you are comfortable writing SQL queries and writing Python, BlazingSQL and BlazingSQL Notebooks will help you with your data science, especially the ETL phase. If you’re not, you may want to consider an AutoML solution with ETL support such as DataRobot, or a drag-and-drop machine learning system such as the Azure Machine Learning Studio.

Cost: Free open source. As a service, $0.75/credit, 1 credit = 1 GPU-hour.

Platform: Service runs on AWS. Open source requires Anaconda or Miniconda; Ubuntu 16.04/18.04 or CentOS 7; Nvidia Pascal+; CUDA 10.1.2 or 10.2; Python 3.7 or 3.8. Alternatively, you can use Docker with Nvidia support.

At a Glance
  • BlazingSQL is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem, allowing standard SQL queries to be distributed across GPU clusters. If you are comfortable writing SQL queries and writing Python, BlazingSQL and BlazingSQL Notebooks will help you with your data science, especially the ETL phase.

    Pros

    • Uses standard SQL queries with GPU acceleration
    • Supports GPU clusters with small changes to the Python code
    • Integrates easily with GPU-based visualization and machine learning
    • Can convert in-GPU structures to standard in-RAM structures for further analysis

    Cons

    • cuDF lacks some features of Pandas, but you can convert between them
    • cuML lacks some features of Scikit-learn, but you can convert the data to Pandas DataFrames
    • Neither PyTorch nor TensorFlow supports cuDF DataFrames, but conversion is possible

Copyright © 2021 IDG Communications, Inc.