Dremio Cloud review: A fast and flexible data lakehouse on AWS
Dremio Cloud leaps big data in a single bound with a fast SQL engine and optimizations that can accelerate queries dramatically. Plus it lets you use other engines on the same data.
- Dremio Cloud overview
- Dremio Arctic overview
- Dremio data file formats
- Dremio query acceleration
- Dremio Engines
- Getting started with Dremio Cloud
- Connecting client applications to Dremio
Dremio’s recommended best practice is to layer your datasets. Start with physical datasets, then add one virtual dataset per physical dataset with minor scrubbing and security redactions and access limits for the second layer. In the third layer, users create virtual datasets that perform joins and other expensive operations. This layer is where the intensive work on data is performed. These users then create reflections (raw, aggregation, or both) from their virtual datasets. The fourth layer is typically lightweight virtual datasets for dashboards, reports, and visualization tools.
I can verify from experiments that reflections can make a huge difference in the speed of complicated queries against big datasets. Reflections reduced the query time on one example with aggregations from 29 seconds to less than one second.
A Dremio Cloud account includes data samples. Here I’ve hovered over the line listing the Dremio University folder to bring up the Format Folder icon. The rest of the main Samples folder has already been formatted into PDS, as you can tell from the color of the icons.
The Sample/Dremio University folder has not yet been formatted, as you can see from the color of the icons. I’ve hovered over the line containing employees.parquet to bring up the Format File icon on the right.
The datasets view defaults to a select *
preview query that displays a limited number of rows. You can use the facilities in this view to help you construct more complicated queries. The full data catalog is hiding in the left-hand column.
The SQL Runner view gives you a data catalog by default as well as the facilities to help you construct joins, aggregations, and filters. By default it doesn’t try to preview any datasets.
The dataset graph view shows the provenance of your datasets. We can see that the data source is a directory in Amazon S3 that has been formatted as a physical dataset. The PDS has in turn been turned into a virtual dataset.
The dataset reflections tab shows the existing reflections for a VDS if they exist, or tries to generate them if they don’t already exist. These dimensions and measures are the defaults for this dataset. Raw reflections are more useful for datasets based on aggregation queries.
Connecting client applications to Dremio
Dremio currently connects to over 10 client applications and languages, listed below. In general, you can authenticate using a Dremio personal access token or using Microsoft Azure Active Directory as an enterprise identity provider. Connecting to Dremio from Power BI Service used to require a gateway, but as of June 2022 you can connect them directly, without a gateway.
Dremio has both ODBC and JDBC drivers. Python applications, including Jupyter Notebooks, can connect using Arrow Flight from the PyArrow library plus Pandas to handle the data frames.
Dremio Cloud currently exposes 10 connectors to client applications, mostly BI and database client applications and Python code.
Overall, Dremio is very good as a data lakehouse. While I disagree with their marketing that denigrates their competitors, they do have a product that can query large datasets with sub-second responses once you’ve created reflections, at least as long as the engine you’re using is active and hasn’t gone to sleep.
The obvious direct competitor to Dremio is the Databricks Lakehouse Platform. I think both platforms are very good, and I would encourage you to try the free versions of both of them. Using multiple engines on the same data is one of Dremio’s selling points, after all, so why not take advantage of it?
—
Cost: Dremio Cloud Standard edition: free forever. Dremio Cloud Enterprise edition: $0.39/DCU (Dremio consumption unit). Cloud infrastructure cost is not included. Dremio Enterprise Software runs on premises: contact sales for pricing.
Platform: Cloud server runs on AWS; server software runs on AWS, Azure, and Linux; requires Java 8. Supported client browsers include Chrome, Safari, Firefox, and Edge.
Copyright © 2022 IDG Communications, Inc.