TigerGraph review: A graph database designed for deep analytics

Highly parallelized and horizontally scalable, TigerGraph shines for use cases that require multi-hop analytic queries

At a Glance

Graph databases offer a more efficient way to model relationships and networks than relational (SQL) databases or other kinds of NoSQL databases (document, wide column, and so on). Lately many products have arisen in this space, which was originally (in 1999) the sole province of Neo4j.

editors choice award logo plum InfoWorld

TigerGraph, a recent arrival, is a “real-time native parallel graph database.” TigerGraph is available for deployment in the cloud or on-premises, it scales both up and out, it automatically partitions a graph within a cluster, it’s ACID compliant, it has built-in data compression, and it claims to be faster than the competition. As we’ll see, it uses a message-passing architecture that is inherently parallel in a way that scales with the size of the data.

TigerGraph was designed to be able to perform deep link analytics as well as real-time online transaction processing (OLTP) and high-volume data loading. By “deep link analytics,” TigerGraph means following relationships from a vertex through the graph for three or more hops and analyzing the results. Most other graph databases were designed primarily for OLTP and for the navigation and analysis of small numbers of hops; any serious analytic capabilities were added later.

While there are several open-source graph query languages that have been widely adopted, such as Cypher, Gremlin, and SPARQL, TigerGraph has a new query language, GSQL. GSQL combines SQL-like query syntax with Cypher-like graph navigation, plus procedural programming and user-defined functions.

I have mixed feelings about TigerGraph’s new GSQL query language. Yes, it’s a nice design; yes, it’s parallelizable; and yes, TigerGraph can convert Cypher to GSQL for people moving from a Neo4j database. Nevertheless, every time I am faced with yet another programming language, I have to ask myself whether it would be worth my time and effort to learn it thoroughly.

My feelings about the rest of the product are less mixed. TigerGraph shows a great deal of promise for a new graph database.

TigerGraph architecture

As we can see in the block diagram below, TigerGraph has an ETL loader (left), graph storage and processing engines with query language and visual clients as well as a REST API (middle), and integration with lots of enterprise data infrastructure services. The system flow diagram further below makes it clear that TigerGraph uses Apache Kafka message queuing to talk to the graph processing and storage engines, with an Nginx web server handling GraphStudio and GSQL requests from multiple users and passing them along to the matching back-end servers.

tigergraph platform overview TigerGraph

The TigerGraph Analytics Platform combines a graph storage engine, a graph processing engine, and three types of API. It can run on-premises, in the cloud, or in a hybrid configuration.

Message passing in TigerGraph allows for parallel processing at the per-vertex and per-edge level. RESTPP, an enhanced REST API server, is central to the task management.

According to TigerGraph, the system can load up to 150 GB of data per hour, traverse hundreds of millions of vertices or edges per second per machine, stream 2B daily events in real-time to a graph with 100B vertices and 600B edges on a cluster of 20 commodity machines, and unify real-time analytics with large-scale offline data processing.

tigergraph platform diagram TigerGraph

TigerGraph uses several popular open-source components: Nginx for web traffic, Apache Kafka for message queuing, and Apache Zookeeper for Kafka cluster management. The rest of the platform currently uses proprietary code.

TigerGraph installation on Docker

You can install TigerGraph on various popular versions of Linux, plus Docker and VirtualBox. I chose to install on an iMac using Docker. Before starting the installation, I updated the Docker installation and increased the available RAM and processors available to Docker to 4 GB and four cores. I downloaded the current TigerGraph image for Docker to my local machine using links from the email I received after registering for a developer license on the site.

Martins-iMac:Downloads mheller$ docker load < ./tigergraph-developer-2.2.3-docker-image.tar.gz
8823818c4748: Loading layer    119MB/119MB
19d043c86cbc: Loading layer  15.87kB/15.87kB
883eafdbe580: Loading layer  14.85kB/14.85kB
4775b2f378bb: Loading layer  5.632kB/5.632kB
75b79e19929c: Loading layer  3.072kB/3.072kB
2106b49716cb: Loading layer  7.168kB/7.168kB
da572f4e0c2f: Loading layer  4.034GB/4.034GB
6cd767fef659: Loading layer  338.4kB/338.4kB
Loaded image: tigergraph:2.2.3
Martins-iMac:Downloads mheller$ docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
tigergraph          2.2.3               e1911655f9a7        3 weeks ago         4.11GB
hello-world         latest              4ab4c602aa5e        2 months ago        1.84kB

That installed without any alarms or excursions, as you can see above. The startup, shown below, also proceeded by the book.

Martins-iMac:Downloads mheller$ docker run -i -t --name tigergraph -p 4142:14240 tigergraph:2.2.3
Welcome to TigerGraph Developer Edition, for non-commercial use only.
[RUN    ] rm -rf /home/tigergraph/tigergraph/logs/ALL*.pid
[FAB    ][2018-11-30 21:56:06] check_port_of_admin_servers
[RUN    ] /home/tigergraph/tigergraph/.gium/GSQL_LIB/service/../scripts/admin_service.sh start
/home/tigergraph/tigergraph/bin/admin_server/config.sh
=== zk ===
[SUMMARY][ZK] process is down
[SUMMARY][ZK] /home/tigergraph/tigergraph/zk is ready
=== dict ===
[SUMMARY][DICT] process is down
[SUMMARY][DICT] dict server has NOT been initialized
=== kafka ===
[SUMMARY][KAFKA] process is down
[SUMMARY][KAFKA] queue has NOT been initialized
=== gse ===
[SUMMARY][GSE] process is down
[SUMMARY][GSE] id service has NOT been initialized
=== gpe ===
[SUMMARY][GPE] process is down
[SUMMARY][GPE] graph has NOT been initialized
=== nginx ===
[SUMMARY][NGINX] process is down
[SUMMARY][NGINX] nginx has NOT been initialized
=== restpp ===
[SUMMARY][RESTPP] process is down
[SUMMARY][RESTPP] restpp has NOT been initialized
[FAB    ][2018-11-30 21:56:47] launch_zookeepers
[FAB    ][2018-11-30 21:57:00] launch_gsql_subsystems:DICT
[FAB    ][2018-11-30 21:57:04] launch_kafkas
[FAB    ][2018-11-30 21:57:22] launch_ts3s
[FAB    ][2018-11-30 21:57:25] launch_gsql_subsystems:GSE
[FAB    ][2018-11-30 21:57:28] launch_gsql_subsystems:GPE
[FAB    ][2018-11-30 21:57:31] launch_gsql_subsystems:NGINX
[FAB    ][2018-11-30 21:57:34] launch_gsql_subsystems:RESTPP
[FAB    ][2018-11-30 21:57:38] check_port_of_vis_services
[RUN    ] LD_LIBRARY_PATH="/home/tigergraph/tigergraph/bin"
home/tigergraph/tigergraph/visualization/utils/start.sh
[FAB    ][2018-11-30 21:57:39] check_port_of_admin_servers
[RUN    ]
home/tigergraph/tigergraph/.gium/GSQL_LIB/service/../scripts/admin_service.sh start
home/tigergraph/tigergraph/bin/admin_server/config.sh
[RUN    ] /home/tigergraph/tigergraph/dev/gdk/gsql/gsql_server_util START || :
=== zk ===
[SUMMARY][ZK] process is up
[SUMMARY][ZK] /home/tigergraph/tigergraph/zk is ready
=== kafka ===
[SUMMARY][KAFKA] process is up
[SUMMARY][KAFKA] queue is ready
=== gse ===
[SUMMARY][GSE] process is up
[SUMMARY][GSE] id service has NOT been initialized (not_ready)
=== dict ===
[SUMMARY][DICT] process is up
[SUMMARY][DICT] dict server is ready
=== ts3 ===
[SUMMARY][TS3] process is up
[SUMMARY][TS3] ts3 is ready
=== graph ===
[SUMMARY][GRAPH] graph has NOT been initialized
=== nginx ===
[SUMMARY][NGINX] process is up
[SUMMARY][NGINX] nginx is ready
=== restpp ===
[SUMMARY][RESTPP] process is up
[SUMMARY][RESTPP] restpp is ready
=== gpe ===
[SUMMARY][GPE] process is up
[SUMMARY][GPE] graph has NOT been initialized (not_ready)
=== gsql ===
[SUMMARY][GSQL] process is up
[SUMMARY][GSQL] gsql is ready
=== Visualization ===
[SUMMARY][VIS] process is up (VIS server PID: 1242)
[SUMMARY][VIS] gui server is up
[RUN    ] rm -rf ~/.gsql/gstore_gs*_autostart_flag
Done.
tigergraph@2089c417aa54:~$

At this point I started the gsql client and worked through some command-line tutorials and demos.

tigergraph@2089c417aa54:~$ ls /home/tigergraph/
friendship.csv   hello2.gsql  hello.gsql  person.csv  tigergraph  tigergraph_coredump
tigergraph@2089c417aa54:~$ gsql
Welcome to TigerGraph Developer Edition, for non-commercial use only.
GSQL-Dev >

TigerGraph’s GSQL 101 and GSQL examples

The GSQL 101 tutorial teaches you how to create graph schemas, load data, and run queries. Everything worked as advertised. I was a little surprised at the time it took to complete certain operations, such as installing custom queries, but the time I saw was consistent with the documentation.

Just to give you a feeling for GSQL, the following custom query demonstrates using accumulators, nesting queries, and installing a custom query.

USE GRAPH social
CREATE QUERY hello2 (VERTEX<person> p) FOR GRAPH social{
  OrAccum  @visited = false;
  AvgAccum @@avgAge;
  Start = {p};
  FirstNeighbors = SELECT tgt
                   FROM Start:s -(friendship:e)-> person:tgt
                   ACCUM tgt.@visited += true, s.@visited += true;
  SecondNeighbors = SELECT tgt
                    FROM FirstNeighbors -(:e)-> :tgt
                    WHERE tgt.@visited == false
                    POST_ACCUM @@avgAge += tgt.age;
  PRINT SecondNeighbors;
  PRINT @@avgAge;
}
INSTALL QUERY hello2
RUN QUERY hello2("Tom")

The GSQL demo examples cover additional territory, including collaborative filtering, PageRank, product recommendations, and shortest path algorithms. These are also worth studying and running, although you’ll want to take them one step at a time so that you can understand what they’re doing, rather than running all the supplied files in a batch. Since the developer license only allows one graph per database, you’ll need to run DROP ALL to remove the social graph created in the GSQL 101 exercise.

The GSQL Graph Algorithm Library implements standard graph algorithms and tests for them as GSQL queries. You can download the library from GitHub. As you’ll see, the base algorithms combine with installation scripts to generate customized algorithms for your graph. The algorithms currently include closeness centrality, connected component detection, community detection, PageRank, shortest paths, and triangle counting.

GraphStudio and the TigerGraph test drives

In addition to the command-line interface for GSQL, TigerGraph has a GUI called GraphStudio. You can run it on your local instance by browsing to localhost:4142, as shown below.

tigergraph graphstudio home IDG

TigerGraph’s GraphStudio GUI makes it easy to design schemas, load data, and explore graphs. Writing queries is done in GSQL.

For learning purposes, however, you might want to explore the TigerGraph test-drive demos. These are read-only graph databases that include some predefined parametrized queries; you can also write your own queries.

The three test-drive use cases that have large data sets (billions of edges) use Amazon EC2 r4.4xlarge (16 vCPU, 122 GiB RAM) instances. The two test-drive use cases that have small databases use more economical Amazon EC2 t2.xlarge (4 vCPU, 16 GiB RAM) instances. In my experience, the performance was pretty good, even for the anti-fraud demo graph (shown below), which has 4.4 billion edges.

tigergraph multitrans query IDG

Using GraphStudio to execute a fraud-detection query against a large (4.4 billion edges) graph of users, devices, and transactions. GraphStudio and TigerGraph are both running on an Amazon EC2 r4.4xlarge instance.

TigerGraph benchmarks

TigerGraph has been touting some benchmarks that compare its performance to several other graph databases (Neo4j, Amazon Neptune, JanusGraph, and ArangoDB) for tests of data loading time and graph analytic query time, all of which are tasks where TigerGraph can parallelize the operations. Not surprisingly, given the way the benchmarks were set up with a bias towards TigerGraph’s strengths, TigerGraph wins on all tests, sometimes by huge factors. The only benchmark in the paper that impresses me is the TigerGraph-only cluster scalability test, which shows a 6.7x speedup when running with eight machines.

The question to ask yourself is “What are my primary use cases for graph databases?” If the answer involves online transaction processing (OLTP), then you may not even care about bulk loading and multi-hop analytic query performance. Ultimately, the only measurement that actually matters is how the database behaves for your application, which is why taking the time to perform a proof of concept is so important when adopting a new technology.

tigergraph scalability graph IDG

TigerGraph cluster scalability test, graphed. There is a 6.7x speedup when using eight machines. If the scalability were perfect, this would be a straight line and there would be an 8x speedup.

TigerGraph in the cloud

As I was doing the research for this review, TigerGraph announced a managed cloud offering to begin trials on AWS in 2019. This is a welcome addition to TigerGraph’s current bring-your-own-license, single-instance images for the AWS and Azure clouds, and the current requirement to perform manual setup for clusters.

At a Glance
  • TigerGraph shows a lot of promise for a new graph database. It distinguishes itself by offering a high degree of parallelism and by performing well for use cases that require multi-hop analytic queries.

    Pros

    • Highly parallel graph processing
    • Expressive GSQL graph query language
    • Able to perform OLTP and OLAP queries simultaneously
    • Scales well in clusters of up to eight machines

    Cons

    • GSQL graph query language is new, and not the same as popular graph query languages
    • The level of hype coming from the company can be hard to swallow
1 2 Page 1
Page 1 of 2