NoSQL operational databases are rapidly gaining adoption. They're easier to use, operate, and scale than their relational counterparts, and they enable faster development of richer applications. Increasingly, they power modern Web, mobile, and IoT applications.
Already, NoSQL data is the fabric that connects the more than 200,000 APIs (a number that continues to rise) in the world, each of which speaks the language of JSON or XML. Within the next five years, the vast majority of the data in the world will be semi-structured. Stated differently, the tabular data that powers today's relational analytics is going to be an insignificant fraction of the total data in the world.
In this brave new world, businesses need a way to derive analytic value from their NoSQL data -- to help them see and understand the data in a way that drives business value. How will you choose a NoSQL analytics system? Here are some defining characteristics to aid you in your search. (For more detail, see my whitepaper.)
A brief history
Historically, NoSQL analytics has been a low-level coding affair, with occasional reprieves for specific data models (such as XQuery for lightweight analytics over XML data). Complex ETL processes often have been built to extract out a flat subset of data for relocation to relational systems, allowing companies to leverage legacy analytics tooling.
Recently, however, the industry has seen three distinct approaches to the problem of NoSQL analytics, each designed to eliminate the need for unnecessary coding and ETL:
- Relational extensions. Existing relational systems are retrofitted with semi-structured data types and functions to probe them. For example, Postgres, Presto, and SparkSQL all have JSON accessor functions. These systems still have a relational core, however, and the limitations that come along with that.
- Virtualization drivers. These drivers take NoSQL data and expose it as a collection of virtual tables in a way that's compatible with existing analytics and BI tooling. Due to the impedance mismatch between relational and NoSQL data models, they frequently discard important information that results in the loss of analytic flexibility.
- Post-relational analytics. These are novel systems designed specifically for analytics over semi-structured data. Specifications include JSONiq (XPath for JSON), SQL++, and MRA, while analytics systems include Apache Drill and Quasar.
Each approach has its advantages and disadvantages, which will become more apparent in my discussion of the capabilities required for general-purpose NoSQL analytics below.
Eight essential characteristics
With the growing number of choices for NoSQL analytics, it's more important than ever to have a way to judge the suitability of any system for general-purpose NoSQL analytics.
Toward that end, I have identified eight characteristics that a system must have in order to handle routine analytic use cases over various NoSQL data models. A system is a good fit for general-purpose NoSQL analytics precisely to the degree it satisfies these characteristics.
1. Generic data model
NoSQL analytics systems must possess a data model generic enough to abstract over the differences between common NoSQL data models. In principle, there are infinitely many NoSQL data models, but in practice, a general-purpose NoSQL analytics system should at least support the following:
- Heterogeneous containers, including sets, maps, and arrays
- A reference type capable of representing edges and foreign keys
- Common atomic types, including booleans, numbers, and dates
An abstraction similar to the above is sufficient to losslessly represent the majority of NoSQL data models, including those utilized by JSON, XML, MongoDB, Elasticsearch, MarkLogic, CouchDB, Cassandra, Neo4j, Aerospike, and many others.
2. Isomorphic data model
NoSQL analytics systems must represent data losslessly.That is, they may not use a data model that discards information present in the original data.
Discarding information present in the original data prevents answering certain kinds of analytic questions. The more information is thrown away, the greater the analytic impact.
This characteristic allows businesses to extract maximum value from their NoSQL data. Systems that do not possess this characteristic -- including the variety of JDBC/ODBC drivers on the market -- invariably fail with common analytic scenarios on NoSQL data.
A NoSQL analytics system must allow analytic operations on nested dimensions of data.
Most NoSQL data is heavily nested ("de-normalized"). For example, a document might contain a list of documents, which in turn contain lists of other documents. In order to answer arbitrary analytical questions on the data, a system must allow the slicing, dicing, and aggregation of these nested dimensions of data.
4. Unified schema/data
NoSQL analytics systems must unify analytic operations on both "schema" and data.
In semi-structured data, the schema is simply data (keys in a map for JSON data, tags for XML, and so on). Sometimes, a NoSQL schema encodes what a relational system would consider schema. But other times, it encodes data (for example, when a map or document is used to build a real-time histogram in a NoSQL database).
NoSQL analytics systems must offer the same capabilities on schema as on data.
NoSQL analytics systems must be strictly more expressive than relational systems.
Even NoSQL data needs to be joined with other data sets (as well as filtered, sorted, and aggregated). Thus, NoSQL systems must have all of the same capabilities as their relational cousins. In other words, they must be postrelational rather than nonrelational.
6. Polymorphic queries
NoSQL analytics systems must support queries across the common elements of structurally heterogeneous values.
There are numerous analytic scenarios where some structure is known and relatively stable across a set of values, but other parts are unknown and vary (for example, clickstream data or product catalog data). These cases demand the ability to perform analytics across the common elements of a set of values that otherwise exhibit arbitrary heterogeneity.
7. Dynamic type discovery and conversion
NoSQL analytics systems must support runtime type identification and conversion so that custom business logic can be used to dictate analytic treatment of variation.
All relational systems support conversion of atomic values (for example, converting a string to a number), but not type identification, because the structure is constant across a column. Because NoSQL data models are richer and data structure is dynamic, NoSQL analytics systems must support type identification as well as far richer conversions and identifications than would be necessary in a relational system.
8. Structural patterns
NoSQL analytics systems must support structural pattern matching that is capable of filtering and extracting from variable length, multidimensional patterns.
Numerous common analytic scenarios on NoSQL data (including event-oriented data, document data, and XML and HTML content) cannot be satisfied without an ability to search for structural patterns in the data.
While most relational systems have a special-purpose function for pattern matching on events, the NoSQL analog of such functionality must be much more sophisticated due to the increased richness of the data model.
Up from SQL
With the growth of NoSQL data, the question of how to derive analytic value from this data has never been more important to companies embracing modern data technologies.
Coding and ETL have provided stopgap solutions, but recent approaches provide first-class solutions to the problem. These include extensions to legacy relational systems, drivers that provide virtual relational models on top of NoSQL data, and "NoSQL first" analytics systems.
In evaluating NoSQL analytics systems, eight characteristics stand out. A system must have all eight in order to be able to handle common analytic scenarios on arbitrary NoSQL data.
Together, these characteristics define what it means for a system to support generalized NoSQL analytics. In addition, they can help guide those of you who are looking for a way to objectively evaluate competing NoSQL analytics systems.
John A. De Goes is Chief Technology Officer at SlamData, a company building open source technology for NoSQL analytics. A prolific speaker, author, and open source contributor, John is a recognized expert in the field of big data analytics. Readers interested in the whitepaper on which this article is based can download it here.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.