Dremio Cloud review: A fast and flexible data lakehouse on AWS

Dremio Cloud leaps big data in a single bound with a fast SQL engine and optimizations that can accelerate queries dramatically. Plus it lets you use other engines on the same data.

1 2 Page 2
Page 2 of 2

Dremio’s recommended best practice is to layer your datasets. Start with physical datasets, then add one virtual dataset per physical dataset with minor scrubbing and security redactions and access limits for the second layer. In the third layer, users create virtual datasets that perform joins and other expensive operations. This layer is where the intensive work on data is performed. These users then create reflections (raw, aggregation, or both) from their virtual datasets. The fourth layer is typically lightweight virtual datasets for dashboards, reports, and visualization tools.

I can verify from experiments that reflections can make a huge difference in the speed of complicated queries against big datasets. Reflections reduced the query time on one example with aggregations from 29 seconds to less than one second.

dremio cloud 04 IDG

A Dremio Cloud account includes data samples. Here I’ve hovered over the line listing the Dremio University folder to bring up the Format Folder icon. The rest of the main Samples folder has already been formatted into PDS, as you can tell from the color of the icons.

dremio cloud 05 IDG

The Sample/Dremio University folder has not yet been formatted, as you can see from the color of the icons. I’ve hovered over the line containing employees.parquet to bring up the Format File icon on the right.

dremio cloud 06 IDG

The datasets view defaults to a select * preview query that displays a limited number of rows. You can use the facilities in this view to help you construct more complicated queries. The full data catalog is hiding in the left-hand column.

dremio cloud 07 IDG

The SQL Runner view gives you a data catalog by default as well as the facilities to help you construct joins, aggregations, and filters. By default it doesn’t try to preview any datasets.

dremio cloud 08 IDG

The dataset graph view shows the provenance of your datasets. We can see that the data source is a directory in Amazon S3 that has been formatted as a physical dataset. The PDS has in turn been turned into a virtual dataset.

dremio cloud 09 IDG

The dataset reflections tab shows the existing reflections for a VDS if they exist, or tries to generate them if they don’t already exist. These dimensions and measures are the defaults for this dataset. Raw reflections are more useful for datasets based on aggregation queries.

Connecting client applications to Dremio

Dremio currently connects to over 10 client applications and languages, listed below. In general, you can authenticate using a Dremio personal access token or using Microsoft Azure Active Directory as an enterprise identity provider. Connecting to Dremio from Power BI Service used to require a gateway, but as of June 2022 you can connect them directly, without a gateway.

Dremio has both ODBC and JDBC drivers. Python applications, including Jupyter Notebooks, can connect using Arrow Flight from the PyArrow library plus Pandas to handle the data frames.

dremio cloud 10 IDG

Dremio Cloud currently exposes 10 connectors to client applications, mostly BI and database client applications and Python code.

Overall, Dremio is very good as a data lakehouse. While I disagree with their marketing that denigrates their competitors, they do have a product that can query large datasets with sub-second responses once you’ve created reflections, at least as long as the engine you’re using is active and hasn’t gone to sleep.

The obvious direct competitor to Dremio is the Databricks Lakehouse Platform. I think both platforms are very good, and I would encourage you to try the free versions of both of them. Using multiple engines on the same data is one of Dremio’s selling points, after all, so why not take advantage of it?

Cost: Dremio Cloud Standard edition: free forever. Dremio Cloud Enterprise edition: $0.39/DCU (Dremio consumption unit). Cloud infrastructure cost is not included. Dremio Enterprise Software runs on premises: contact sales for pricing.

Platform: Cloud server runs on AWS; server software runs on AWS, Azure, and Linux; requires Java 8. Supported client browsers include Chrome, Safari, Firefox, and Edge.

At a Glance
  • Dremio Cloud brings a fast SQL engine and efficient columnar storage to the data lake. It also has the ability to accelerate queries with automated “reflections,” and the ability to share data with other analytical engines. It supports BI and machine learning through drivers and connectors to third-party software.

    Pros

    • Fast SQL engine for files in a data lake
    • Able to accelerate queries with automated “reflections”
    • Able to share data with other engines
    • Has connections to BI and database software as well as Python
    • Standard edition is free forever, but doesn’t include cloud infrastructure costs

    Cons

    • Has no machine learning or deep learning capabilities of its own; use Python for that

Copyright © 2022 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
How to choose a low-code development platform