DSE Graph review: Graph database does double duty

DSE Graph provides high-performance OLTP and OLAP graph operations, right alongside the DataStax Enterprise column store

DSE Graph review: Graph database does double duty
Thinkstock
At a Glance

Graph databases explicitly express the connections between nodes, and are more efficient at the analysis of networks (computer, human, geographic, or otherwise) than relational databases. There has been an abundance of good distributed graph databases recently including Amazon Neptune (OLTP, uses the Gremlin and SPARQL query languages), AnzoGraph (OLAP, uses SPARQL*, an enhancement over SPARQL), Neo4j (OLTP with some OLAP capabilities, uses Cypher); and TigerGraph (hybrid OLTP and OLAP, uses GSQL).

DSE Graph is a distributed graph database that is built on a back-end columnar database, DataStax Enterprise (DSE), and uses the Apache TinkerPop Gremlin query language. The product grew out of the open source Titan database, which had multiple back-ends including Cassandra. DataStax Enterprise is an enhanced version of Cassandra.

When DSE Graph was first released, the other Titan back-ends were removed, and the Titan code was completely rewritten to take better advantage of DSE. The data mapping from graphs to columns was rather sparse, however, and loading graphs required using a dedicated graph loader. In the current preview version of DSE Graph 6.8, the graph vertex, edge, and property data mappings to columns are much tighter than previously, and loading graphs can be accomplished with the DSE bulk loader dsbulk, which is the same utility used to load columns.

DSE Graph supports both transactional and analytic workloads, using two different engines. The analytic engine relies on Spark, which is shipped as part of the DSE product.

The current version of DSE Graph is designated Graph Core. The old version is now called Graph Classic; Graph Classic is still in the product for backward compatibility. Graph Core and DSE 6.8 are expected to ship in 2020.

DSE Graph Core architecture and key features

According to DataStax, the Core Engine initiative improves DataStax Graph three major ways: the graph model is aligned with regular C* (Cassandra) tables; usability is improved; and performance is enhanced. The graph performance improvement is largely from simplifying the read and write paths.

In the new graph data model, a graph corresponds to a CQL keyspace, a vertex or edge label corresponds to a CQL table, and a property of the underlying vertex or edge label corresponds to a CQL column. Existing keyspaces can be converted to graphs with CQL using the ALTER KEYSPACE syntax, and existing tables can be converted to vertex or edge labels with the ALTER TABLE syntax.

Another benefit of the Core Engine architecture is that you can load graph data just like CQL data, since the graph data really is  stored as C* keyspaces, tables, and columns. The dsbulk utility is much faster than the old graphloader utility used for Classic Engine graphs.

dse graph architecture DataStax

DSE Graph architecture. This diagram pre-dates the DSE 6.8 preview, but it’s essentially correct. What has changed is the mapping between DSE Graph and Cassandra, which is not shown explicitly.

DSE Graph installation

I installed DataStax Graph 6.8 and DataStax Studio on an iMac two ways: using Docker/Kubernetes (DataStax Studio), and directly from a tarball. These are Labs previews, not shipping products.

I ran into a few problems, but fixed them with help from DataStax. My installed Java JDK was a little older than the minimum supported version; I fixed that by installing JDK 13 from Oracle. I also had old installations of DSE 6.0 and Spark on my machine, which the new DSE 6.8 installation picked up. With the help of DataStax, I set some paths in cassandra.yaml to “sandbox” the installation, and for good measure I removed the old installations including /var/lib/cassandra and /var/log/cassandra.

While I eventually got both the Docker and tarball installations working, the performance I was able to get out of my old iMac (Core i7 CPU, 16 GB of RAM, and a hard disk) was nothing to write home about.

DataStax created two cloud clusters for me to use for testing with performance that was more representative of what Enterprises use. The larger cluster had 3 nodes, each an AWS m5d.8xlarge instance with 32 vCPUs, 128 GB of RAM, and SSDs. Needless to say, compared to my iMac this cluster was wicked fast.

Testing DSE Graph

I went through the standard DataStax Studio demo notebook on working with graphs on my iMac, but there wasn’t much data involved. For use on the three-node cluster on AWS, Dr. Denise Gosnell supplied two bigger data sets and notebooks to demonstrate analyzing them, based on material from her upcoming book on graph data (to be published by O’Reilly in 2020). The smaller data set contains 36K trust ratings among 6K Bitcoin providers; the larger data set contains 300K movies and 20M user rankings.

For both data sets, the first step was to create the graph schemas, and the second step was to bulk load the data into the graph databases. After that, I ran various Gremlin queries on the data, using either the OLTP query engine or the Spark OLAP query engine, as appropriate. The screen images below, with their captions, show my testing process.

dse graph 02 IDG

On the left, this notebook cell shows some of the new Gremlin schema calls. On the right, the dropdown shows the two Gremlin engine options, OLTP and OLAP (Spark). The Spark engine can operate at greater scale than the OLTP engine, but supports only a subset of Gremlin.

dse graph 03 rev IDG
This terminal screen shows dsbulk loading the vertex and edge data for the Bitcoin trust database. Each load took less than a second. The rows/s column is more meaningful than the elapsed time.
 
dse graph 04 IDG

This image shows the community of trust around a specific Bitcoin provider, number 1094. The with("label-warning",false) clause at the beginning suppresses an annoying message that doesn’t matter in this context. From the starting vertex, we follow all incoming and outgoing edges to the neighbor vertexes, then all incoming and outgoing edges from those.

dse graph 05 IDG

The Gremlin sack() function allows you to accumulate values as you perform a graph traversal. The withSack(0.0) call initializes the sack to a floating point zero. The sack(sum).by("trust") adds the trust value from each edge. As you can see, the total trust determines the sort order, using order().by(sack(), decr).

dse graph 06 IDG

PageRank is a global algorithm developed at Google to measure the relative importance of web pages. In TinkerPop you need to do your graph traversal with a Graph Computer in order to call pageRank; in DSE Graph, you need to run an analytic query using Spark to call pageRank. Here we ran a basic pageRank() query against the Bitcoin trust graph; it took about 39 seconds. Our previous OLTP queries against the same graph each took less than a second. PageRank touches every vertex in the graph unless you filter for specific vertex labels. Here we are viewing the results as a table.

dse graph 07 IDG

We can also view the PageRank results as a bubble graph. Here I have used the degree of connectivity to size the vertices and the Louvain Community Detection (clustering) algorithm to color the vertices.

dse graph 08 IDG

Here I’ve moved up to the main DataStax Studio notebooks screen. The semi-duplicated notebooks are there to allow me to reproduce what I saw in a demo, including data loading, without having to drop graphs.

dse graph 09 IDG

We’re now starting to look at a notebook based on a join of the MovieLens and Kaggle Movie databases.

dse graph 10 IDG

These schema() calls define part of the movie graph database we’ll be using for the next half-dozen steps. Note the use of partitionBy() for vertex labels but not edge labels. Partition keys in graph vertices correspond to primary keys in C* tables, and imply index creation.

dse graph 11 rev IDG

Here we are using dsbulk in a shell running in the cluster to load data into the schema we just created in the notebook. The largest set is the ratings, comprising 20 million edges, which loaded in 1 minute 23 seconds.

dse graph 12 IDG

The movie data set, merged from IMDB and TMDB, contains over 300K movies.

dse graph 13 IDG

There are over 20 million ratings edges, which came from MovieLens. Note that this was an analytic query run as a Spark job, and only took 1.4 seconds.

dse graph 14 IDG

The query shown here uses aggregate("x") calls to create a temporary collection of movies with ratings similar to Toy Story’s, and cap("x") to emit that collection. The unfold() step unbundles the collection so that we can view the items.

dse graph 15 IDG

Here we are looking at movies rated by a single user. The valueMap() call displays all the values of each movie vertex returned by the traversal.

dse graph 16 IDG

Here we are looking at the same user, and returning the movie name and her rating for her 20 highest-rated movies.

Graph database on the side

DSE Graph offers high-performance OLTP and OLAP graph operations in a way that is integrated with DataStax Enterprise proper, eliminating the need to have both a columnar database and a graph database. As we’ve seen, DSE Graph has very good support for Gremlin and very good scalability. When DSE 6.8 is released in 2020, I expect the product to be competitive with Amazon Neptune, Neo4j, and TigerGraph.

It’s impossible to discuss value in the absence of pricing information, and DataStax will not release pricing to me. Nevertheless, it’s worth putting DataStax on your evaluation list for graph databases. A week or two of a free trial using a decent-sized cluster, with a proof of concept for your own applications, should tell you what you need to know.

— 

Cost: DataStax Enterprise, which includes DSE Graph, is free for non-production use, but requires a subscription to be used in production. Subscriptions are priced either by node or by core.

Platform: Windows, MacOS, Linux; Docker, Kubernetes. 

At a Glance
  • DSE Graph offers high-performance OLTP and OLAP graph operations in a way that is integrated with DataStax Enterprise proper, the company’s enhanced version of Cassandra.

    Pros

    • Graph database for both OLTP and OLAP applications
    • Very good performance and scalability
    • Standard open-source Gremlin query language
    • Well integrated with DataStax Enterprise

    Cons

    • No SPARQL support
    • Tarball installation can be tricky if you have ever used Cassandra or Spark on the same computer

Copyright © 2019 IDG Communications, Inc.