AnzoGraph: A graph database for deep analytics

AnzoGraph is a fast, horizontally scalable, OLAP graph database that brings a wealth of analytics capabilities to large graphs

AnzoGraph: A graph database for deep analytics
Jan Traid / Getty Images
At a Glance

Graph databases such as Neo4j, TigerGraph, Amazon Neptune, the graph portion of Azure Cosmos DB, and AnzoGraph, the subject of this review, offer a natural representation of data that is primarily about the relationships between people, places, and things. Graph databases are good for applications for fraud detection, social networks, recommendation systems, and so on.

This essay makes a good case for graph databases over relational databases for these kinds of apps. TL;DR version: Once you need complex joins of large tables, relational database queries slow down; the same task is faster on a graph database.

Like relational databases, graph databases can be designed for efficient online transaction processing (OLTP) or efficient online analytical processing (OLAP), and occasionally for both (HTAP, hybrid transaction/analytical processing). Neo4j, Neptune, and Cosmos DB are all OLTP graph databases, although Neo4j has recently added some OLAP capabilities. TigerGraph is an HTAP graph database and claims swift, deep analytics as well as fast transaction processing.

AnzoGraph, on the other hand, is designed as an OLAP graph database. Cambridge Semantics actually says “Complement your OLTP graph database engine with OLAP” on the main web page for AnzoGraph.

Neo4j uses its own query language, Cypher, for its labeled property graphs; there is an open source version, openCypher. TigerGraph uses its own query language, GSQL. Neptune has both RDF (SPARQL) and labeled property graph (Gremlin) graph stores. They both exist on the same fabric, but they don’t connect to each other. Cosmos DB’s graph database uses Gremlin, which is the graph traversal language of Apache TinkerPop.

AnzoGraph uses W3C-standard RDF triple and quad data and SPARQL 1.1 queries. It also supports labeled property graphs as part of the RDF store, conforming to the proposed RDF* and SPARQL* standards. AnzoGraph has extensions to SPARQL to support graph algorithms, inferencing, window aggregates, BI functions, and named views. Support for openCypher and Bolt (the Neo4j protocol) is planned.

AnzoGraph architecture

As you can see in the figure below, AnzoGraph is a massively parallel in-memory graph database that works with enterprise data sources, does parallel data loads of RDF and CSV formats, and provides BI analytics, graph algorithms, inferencing, data science functions, and user-defined functions. It works with Python programs, Apache Zeppelin notebooks, and Jupyter notebooks, as well as with third-party clients such as KeyLines and Graphileon. AnzoGraph can be run stand-alone or inside Anzo, Cambridge Semantics’ data discovery and integration platform.

If you are writing Python against AnzoGraph — whether in a program or in a notebook — you can call a Python SPARQL client to make queries. In Zeppelin notebooks you can also write SPARQL code inside a cell with a %sparql directive at the top, and pass the results to Python in subsequent cells, for graphing and analysis.

anzograph architecture diagram Cambridge Semantics

AnzoGraph reference architecture diagram.

AnzoGraph features, benefits, and applications

AnzoGraph features high-performance graph query execution and scalability to billions and even trillions of triples, as well as fast parallel data loads that don’t require taking the database offline. The AnzoGraph analytics story is that it encompasses all kinds of analytics: graph algorithms, graph views, named queries, aggregates, data science functions, and data warehouse-style BI and reporting.

The company claims that AnzoGraph is priced for affordability and scale, but refuses to provide actual pricing without an NDA. The pricing is based on the number of nodes, but there’s no way for a reviewer to confirm or deny the claim of affordability.

AnzoGraph has reference customers in financial services (PwC) and pharmaceuticals (Eli Lilly and Merck). These and other customers are using AnzoGraph for scientific data discovery, anti-fraud and anti-money laundering, and a 360-degree view of the customer.

AnzoGraph deployment options

The documentation offers instructions for setting up three types of AnzoGraph sandbox and three types of full deployment. The three deployment options are on AWS CloudFormation, Docker/Kubernetes, and RHEL/CentOS. Google Cloud Platform and Azure deployments are usually treated as Kubernetes deployments.

The sandbox deployments are single nodes with minimal barriers to use by a developer. The full deployments are managed clusters with network isolation and security.

The AWS CloudFormation templates offer multiple deployment scenarios that correspond with commonly used AWS network and environment configurations. One has restricted access, as shown in the diagram below. A second has intranet integration. And a third uses AWS PrivateLink to support multiple availability zones.

anzograph aws Cambridge Semantics

AnzoGraph can be deployed on CentOS, Kubernetes, and AWS. The diagram above is for a simple deployment on AWS with restricted access.

Installing AnzoGraph on Docker

I chose to create an AnzoGraph sandbox on Docker CE, using an iMac with 16 GB of RAM and an Intel Core i7 processor. Before the installation, I upgraded Docker to the latest build, installed Kitematic, and allocated four CPUs, 12 GiB of RAM, and 3.5 GiB of swap space to Docker. The screenshot below shows AnzoGraph booting up in Kitematic.

anzograph docker kitematic IDG

AnzoGraph startup in Docker Kitematic. Note that the logs mention “jetty,” which I assume is Eclipse Jetty, a web server and javax.servlet container that is often used for machine-to-machine communications.

For installation and initial data loading and queries, I followed the First Five Minutes documentation for Docker deployments. This involves loading the Tickit sample data that ships with the product and running some SPARQL queries against it. More queries appear in the Working with the Tickit Data topic.

AnzoGraph SPARQL examples

The example SPARQL queries start simple and advance to rather deep fraud-detection queries. The first query just counts the triples:

anzograph query console IDG

The AnzoGraph query console allows us to run SPARQL queries against the graph database. Here we are simply counting the triples in the Tickit graph. (There are 5.5 million.)

As the sequence progresses, we zero in on identity thieves and scalpers. The query below shows that Axel Dominguez scalped tickets for $2,500 when the average ticket price was around $400.

anzograph sparql query IDG

This SPARQL query against the Tickit graph lists the sellers who charged above-average prices for tickets to given events.

The queries make more sense if you refer to the model for the graph, shown below. The diagram did not come out of AnzoGraph’s own tools; they are working with partners to add that capability.

anzograph tickit ontology IDG

Ontology (model) for the Tickit graph data set. Circles are nodes, arrows are edges, and rectangles are properties.

The full query to find ticket scalpers follows. The subquery to find the average ticket price per event runs first, and feeds into the results of the outer SELECT statement. The FILTER clause restricts the output to scalpers who sell for more than the average price, and the ORDER BY desc(?priceperticket) clause shows the highest selling price first. This query ran in 16.5 seconds, according to the admin query log.

SELECT ?sellername ?avg_price ?priceperticket ?eventname ?listtime
FROM <tickit>
WHERE {
  { SELECT ?eventname (avg(?priceperticket) as ?avg_price)
  WHERE {
    ?listing <eventid> ?eventid .
    ?eventid <eventname> ?eventname .
    ?listing <priceperticket> ?priceperticket .
    }
    GROUP BY ?eventname
  }
  ?listing <listtime> ?listtime .
  ?listing <priceperticket> ?priceperticket .
  ?listing <sellerid> ?seller .
  ?seller <firstname> ?firstname .
  ?seller <lastname> ?lastname .
  BIND(CONCAT(?firstname, " ", ?lastname) AS ?sellername)
  FILTER (?priceperticket > ?avg_price)
}
ORDER BY desc(?priceperticket) ?sellername ?eventname
LIMIT 1000

There is additional documentation on using SPARQL in the AnzoGraph SPARQL reference, and more basic information about SPARQL, the semantic web, and RDF in the Cambridge Semantics Semantic University tutorial.

AnzoGraph benchmarks

As you saw above, SPARQL* can do essentially everything that SQL can, except that it works on RDF* databases rather than relational databases. Cambridge Semantics took advantage of this to translate the TPC-H OLAP benchmark from SQL. While they used this to compare AnzoGraph (an OLAP graph database) to Neo4j (an OLTP graph database), they admitted that the comparison was unfair, and I decline to print that here. I will, however, print the comparison they did between two different AnzoGraph configurations.

anzograph benchmark comparison Cambridge Semantics

This benchmark compares the time to run 22 SPARQL queries translated from TPC-H at two different scales. At scale 25 (2.6 billion triples), only one node was needed. At scale 1000 (100 billion triples), 40 nodes (each AWS r4.8xlarge) were used to run the whole suite in 3.5 minutes.

As always, I suggest that you pay little or no attention to the synthetic benchmarks run by the vendor. Instead, you should attempt to construct a proof of concept of AnzoGraph (and any points of comparison) for your own data sets and applications. You should also twist the vendor’s arms to get pricing information up front so that you can calculate your costs and your cost-to-performance ratios.

AnzoGraph may well turn out to fit into your database estate and enable faster OLAP queries, but it might not. You should probably compare it to TigerGraph, among others.

You’ll also have to factor in the cost of an OLTP database, since AnzoGraph was designed for OLAP and not for OLTP. That could be another graph database such as Neo4j or Amazon Neptune, or a different kind of NoSQL database such as CouchDB or MongoDB, or a SQL database such as PostgreSQL or MariaDB.

Cost: Pricing is based on the number of nodes with a standard recommended configuration. 60-day free trial available.

Platform: Docker/Kubernetes, CentOS, RHEL; AWS, GCP, Azure

At a Glance
  • AnzoGraph is a fast, horizontally scalable OLAP graph database that provides BI analytics, graph algorithms, inferencing, data science functions, and user-defined functions.

    Pros

    • W3C-standard RDF triple and quad data and SPARQL 1.1 queries
    • Property graphs conforming to the proposed RDF* and SPARQL* standards
    • Good SPARQL query performance
    • Scales out in sets of four nodes to handle very large graphs (100 billion triples)

    Cons

    • Lacks a client that can display graphs, but working with partners to create one
    • Lacks Neo4j compatibility, but working on openCypher and Bolt support
    • Pricing is not publicly available

Copyright © 2019 IDG Communications, Inc.

How to choose a low-code development platform