Review: OmniSci GPU database lifts huge data sets

GPU acceleration of database, rendering, and visualization enables interactive exploration of data sets with billions of rows

At a Glance

Many of us are awash in data, to the point where conventional databases and conventional BI systems can’t keep up, at least not in real time. There are workarounds, such as sampling the data or working with day-old reports, but each one is a compromise.

OmniSci, formerly called MapD, can keep up with massive amounts of data in real-time, by using GPUs to accelerate its database, rendering engine, and visualization system. OmniSci has found applications in a number of industries that generate significant amounts of data, including telecom, automotive telematics, oil and gas exploration, defense, and intelligence.

With both mapping and BI capabilities and sub-second response times even with tens of millions of rows, you would expect OmniSci to compete directly with Tableau and Esri. But in fact OmniSci makes a big deal about how it can be used to accelerate both Tableau and Esri.

According to the company, OmniSci will be integrated with machine learning capabilities and become more interesting to data scientists in the next year. That makes technological sense, since the product already depends on CUDA and Nvidia GPUs, and since Nvidia has developed the necessary GPU-accelerated machine learning and deep learning libraries. I’m not clear, however, on how that will work from the viewpoint of a user.

Alternatives to OmniSci as a GPU-accelerated database analytics platform include Brytlyt, SQream DB, BlazingSQL, and Kinetica.

OmniSci features and architecture

As shown in the diagram below, OmniSci has multiple components. The three major components are the core database engine, the rendering engine, and the data visualization interface.

OmniSci Core is an open-source GPU-accelerated SQL relational database server engine with strong GIS (geospatial) support and some data science capabilities. The SQL dialect supported is called OmniSQL, and it appears similar to MySQL and PostgreSQL for the most part. For example, OmniSQL uses a LIMIT clause (MySQL and PostgreSQL) to truncate SELECT query result sets rather than a TOP (SQL Server) or ROWNUM (Oracle) clause. The geospatial support uses Open Geospatial Consortium (OGC) types.

The key differentiator for OmniSci Core is its ability to return results in milliseconds, even on tables with billions of rows. Of course, you need lots of RAM and especially lots of GPU VRAM to get performance like that. Specifically, 2 GB of GPU RAM to handle 30 million rows, scaling linearly with GPU RAM.

OmniSci Render is a GPU-accelerated graph server, which takes the output of SQL queries against OmniSci Core and uses them to generate charts such as point maps, choropleth maps, and scatterplots. Render uses Vega Visualization Grammar specifications to define the output, which it creates as a PNG image. The PNG image is then sent over the wire to Immerse, which is much faster and more efficient than rendering millions of points on the client.

OmniSci Immerse is a web-based data visualization interface. Its user interface for defining charts is very similar to BI tools such as Qlik and Tableau. Immerse charts combine into dashboards, and the user can cross-filter the charts on a dashboard, for example by selecting an item on a pie chart or by zooming into a point map. I’ll offer a few examples of this when I discuss some of the OmniSci demos.

omnisci architecture diagram OmniSci

This diagram shows the high-level architecture of the OmniSci Platform. The core database SQL engine is open source.

OmniSci SKUs

OmniSci is available in enterprise, cloud, and open source versions. The enterprise version can be configured for high availability. The open source version is just the OmniSci Core database.

You can run the free open source OmniSci Core SQL database on-premises or in the cloud. If you want good performance, run it with Nvidia GPUs. Allow 1 GB of GPU memory for each 15 million rows of data that you want to analyze.

If you want the full benefit of OmniSci, including the GPU-boosted rendering engine and the Immerse web UI, consider either the Enterprise version or OmniSci Cloud. If you want to run on-premises, the Enterprise version is what you need. Either OmniSci Enterprise running in one of the big three public clouds, or the OmniSci Cloud, will give you a browser-based and cloud-based system.

omnisci editions feature comparisons OmniSci

OmniSci is available in enterprise, cloud, and open source versions. The Enterprise version can be configured for high availability. The open source version is just the OmniSci Core database.

OmniSci Cloud and demos

I signed up for a free 14-day trial of the OmniSci Cloud, on a plan that comes with access to 2 GB of GPU memory. Cloud trials have three dashboards pre-installed: NYC Tree Census 2015, NYC Taxi Rides, and Flights Demo. I explored these and several of the shared standalone demos, which have more rows and run on larger instances.

All of these demos run on flattened data sets. While OmniSci supports JOINs and VIEWs, using them does add some overhead.

The NYC Tree Census demo dashboard reflects the tree population of New York City in 2015 and has 683,788 rows. When exploring this relatively small data set, I experienced consistent sub-second response.

omnisci nyc tree census with popup IDG

The NYC Tree Census dashboard comes with a map showing tree locations and type, a donut chart of tree health, a bar chart of tree species counts, and a histogram of tree diameters.

omnisci nyc tree census poor health IDG

Suppose a crew wanted to further assess the trees in poor health near Presbyterian Hospital in NYC. In the screenshot above, I zoomed in on the area, cross-filtered to select trees in poor health, and lassoed the area of interest.

omnisci nyc tree census pear IDG

Imagine that a TV producer was looking for a street lined with Callery pear trees on the upper east side of Manhattan to shoot an outdoor scene. In the screenshot above I’ve cross-filtered for Callery pear trees and zoomed in on the east side. East 79th Street between 1st Avenue and 2nd Avenue looks promising.

The NYC Taxi Rides dashboard shows 13 million rides taken in December 2015. I got sub-second response from this dashboard as I was exploring it.

omnisci nyc taxi dropoffs hospital IDG

Zooming into the area of Presbyterian Hospital shows heavy taxi drop-offs along the drive to the main entrance as well as along York Avenue. Most of the taxi rides originated in Manhattan and were less than five miles.

The shared taxi tipping demo uses seven years of the NYC Taxi Rides data in addition to joining the ride table to a building data set, with the nearest building to each drop-off and pickup location stored in the table. This data set has 1.2 billion records, which is significant even for OmniSci. There were times when I zoomed and panned the map that the background took several seconds to fill in, and applying cross-filtering that affected hundreds of millions of rows also caused a multi-second refresh.

omnisci nyc taxi tipping IDG

Some quick observations from the taxi demo: Average tip percentages have been going up since the introduction of credit card readers in cabs; tips are highest during the morning rush hour and late at night; the average tip for a pickup at the Metropolitan Museum of Art is about six percent.

There’s a small slice (seven million records from 2008) of the US Flights data set in a pre-installed dashboard, but the full data set (176 million flight records) is available in a shared demo. The charts mostly updated in two to three seconds as I explored the full data set.

omnisci us flights IDG

The US Flights demo contains 176 million flight records from 1987 to 2017. Note the big dip in the number of flights after September 11, 2001. There are more insights on the Flights dataset in this blog post.

Designing OmniSci charts

As you can see in the screenshot below, OmniSci supports 16 chart types. Each kind of chart has its own designer; the one below is for point maps. This example is from the NYC Tree Census dashboard.

omnisci point map edit IDG

Anyone familiar with designing a chart in Tableau or another BI system will find designing charts in OmniSci easy to learn.

OmniSci interfaces and APIs

The OmniSci database supports ODBC and JDBC (including RJDBC) connectors. It also supports SQL queries from the Immerse command line. When you use Immerse graphically, it generates SQL queries under the covers. You can see the SQL queries as they happen by opening a JavaScript console in your browser and typing SQLLogging(true) plus <enter>.

There are two APIs to connect to OmniSci from Python. Pymapd implements a Python DB API 2.0-compliant interface and returns results in the Apache Arrow-based GDF (GPU Data Frame) format for efficient data interchange. JayDeBeApi provides an interface to the JDBC connector from Python; the query results can be used to instantiate a Pandas DataFrame, from which you can analyze and plot the data.

If you have tables with billions of rows that you need to explore interactively without downsampling, OmniSci’s GPU-accelerated analytics platform is just what you need. Being able to dive into a data set of that size, plotting the results as you go with response times that are less than three seconds, is a liberating experience for a data analyst.

Similarly, if your data constantly streams into your database, OmniSci can give you a good compromise between trying to analyze the stream live and analyzing day-old snapshots, by letting you refresh your data sets. You can refresh from the Immerse dashboard manually (using the Immerse refresh icon, not the browser refresh key), or automatically at intervals.

While OmniSci isn’t the only GPU-accelerated database and analysis platform, it’s certainly a good one. Whether it fits into your digital estate depends on what else you are using, how much data you have, and whether you need to explore your data in real time.

Cost: OmniSci open source: Free. OmniSci Cloud: $95 to $2,050 per month after 14-day free trial. OmniSci Enterprise: Contact sales; free trial.

Platform: CentOS/RHEL, Ubuntu, Arch Linux, MacOS. CUDA required to use GPUs. OmniSci Cloud requires only a browser.

At a Glance
  • OmniSci, formerly called MapD, can keep up with massive amounts of data in real-time, by using GPUs to accelerate its database, rendering engine, and visualization system.

    Pros

    • Able to handle tables with billions of rows
    • GPU acceleration allows for response times in seconds
    • Includes GPU-accelerated chart generation
    • Handles geographic information
    • Core database is open source

    Cons

    • Graphics system is proprietary
    • Requires Nvidia GPUs for short response times

Copyright © 2019 IDG Communications, Inc.