Dremio: Simpler and faster data analytics

Built on Apache Arrow and Apache Parquet, Dremio brings self-service to data analysts and SQL queries to NoSQL data sources

1 2 Page 2
Page 2 of 2
{
  "from" : 0,
  "size" : 4000,
  "query" : {
    "bool" : {
      "must" : [ {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "TX",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "UT",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NM",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "bool" : {
          "must_not" : {
            "match" : {
              "state" : {
                "query" : "NJ",
                "type" : "boolean"
              }
            }
          }
        }
      }, {
        "range" : {
          "review_count" : {
            "from" : 100,
            "to" : null,
            "include_lower" : false,
            "include_upper" : true
          }
        }
      } ]
    }
  }
}

There’s really no limit to the SQL that can be executed on Elasticsearch or any supported data source with Dremio. Here is a slightly more complex example that involves a windowing expression:

SELECT
  city,
  name,
  bus_review_count,
  bus_avg_stars,
  city_avg_stars,
  all_avg_stars
FROM (
  SELECT
    city,
    name,
    bus_review_count,
    bus_avg_stars,
    AVG(bus_avg_stars) OVER (PARTITION BY city) AS city_avg_stars,
    AVG(bus_avg_stars) OVER () AS all_avg_stars,
    SUM(bus_review_count) OVER () AS total_reviews
  FROM (
    SELECT
      city,
      name,
      AVG(review.stars) AS bus_avg_stars,
      COUNT(review.review_id) AS bus_review_count
    FROM
      elastic.yelp.business AS business
      LEFT OUTER JOIN elastic.yelp.review AS review ON business.business_id = review.business_id
    GROUP BY
      city, name
  )
)
WHERE bus_review_count > 100
ORDER BY bus_avg_stars DESC, bus_review_count DESC

This query asks how top-rated businesses compare to other businesses in each city. It looks at the average review for each business with more than 100 reviews compared to the average for all businesses in the same city. To perform this query, data from two different datasets in Elasticsearch must be joined together, an action that Elasticsearch doesn’t support. Parts of the query are compiled into expressions Elasticsearch can process, and the rest of the query is evaluated in Dremio’s distributed SQL execution engine.

If we were to create a Data Reflection on one of these datasets, Dremio’s query planner would automatically rewrite the query to use the Data Reflection instead of performing this push-down operation. The user wouldn’t need to change their query or connect to a different physical resource. They would simply experience reduced latency, sometimes by as much as 1000x less depending on the source and complexity of the query.

An open source, industry standard data platform

Analysis and data science is about iterative investigation and exploration of data. Regardless of the complexity and scale of today’s datasets, analysts need to make fast decisions and iterate, without waiting for IT to provide or prepare the data.

To deliver true self-sufficiency, a self-service data fabric should be expected to deliver data faster than the underlying infrastructure. It must understand how to cache various representations of the data in analytically optimized formats and pick the right representations based on freshness expectations and performance requirements. And it must do all of this in a smart way, without relying on explicit knowledge management and sharing.

Data Reflections are a sophisticated way to cache representations of data across many sources, applying multiple techniques to optimize performance and resource consumption. Through Data Reflections, Dremio allows any user’s interaction with any dataset (virtual or physical) to be autonomously routed through sophisticated algorithms.

As the number and variety of data sources in your organization continue to grow, investing and relying on a new tier in your data stack will become necessary. You will need to find a solution built on open source technology, that itself has an open source core that is built on industry standard technologies. Dremio provides a powerful execution and persistence layer built upon Apache Arrow, Apache Calcite, and Apache Parquet, three key pillars for the next generation of data platforms.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (Adobe), and aQuantive (Microsoft).

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

1 2 Page 2
Page 2 of 2