There has been a lot of hype recently about graph databases. While graph databases such as DataStax Enterprise Graph (based on Titan DB), Neo4, and IBM Graph have been around for several years, recent announcements of managed cloud services like AWS Neptune and Microsoft’s addition of graph capability to Azure Cosmos DB indicate that graph databases have entered the mainstream. With all of this interest, how do you determine whether a graph database is right for your application?
What is a graph database?
Before we go any further, let’s define some terminology. What is a graph database? Think of it in terms of the data model. A graph data model consists of vertices that represent the entities in a domain, and edges that represent the relationships between these entities. Because both vertices and edges can have additional name-value pairs called properties, this data model is formally known as a property graph. Some graph databases require you to define a schema for your graph—i.e. defining labels or names for your vertices, edges, and properties prior to populating any data—while other databases allow you to operate without a fixed schema.
As you might have noticed, there isn’t any new information in the graph data model that we couldn’t capture in a traditional relational data model. After all, it’s simple to describe relationships between tables using foreign keys, or we can describe properties of a relationship with a join table. The key difference between these data models is the way data is organized and accessed. The recognition of edges as a “first class citizen” alongside vertices in the graph data model enables the underlying database engine to iterate very quickly in any direction through networks of vertices and edges to satisfy application queries, a process known as traversal.
The flexibility of the graph data model is a key factor driving the recent surge in graph database popularity. The same requirements for availability and massive scale that drove the development and adoption of various NoSQL offerings over the past 10 or so years are continuing to bear fruit in the recent graph trend.
How to know when you need a graph database
However, as with any popular technology, there can be a tendency to apply graph databases to every problem. It’s important to make sure that you have a use case that is a good fit. For example, graphs are often applied to problem domains like:
- Social networks
- Recommendation and personalization
- Customer 360, including entity resolution (correlating user data from multiple sources)
- Fraud detection
- Asset management
Whether your use case fits within one of those domains or not, there are some other factors that you should consider that can help determine if a graph database is right for you:
- Many-to-many relationships. In his book “Designing Data Intensive Applications” (O’Reilly), Martin Kleppmann suggests that frequent many-to-many relationships in your problem domain is a good indicator for graph usage, since relational databases tend to struggle to navigate these relationships efficiently.
- High value of relationships. Another heuristic I’ve frequently heard: if the relationships between your data elements are just as important or more important than the elements themselves, you should consider using graph.
- Low latency at large scale. Adding another database into your application also adds complexity to your application. The ability of graph databases to navigate through the relationships represented in large data sets more quickly than other types of databases is what justifies this additional complexity. This is especially true in cases where a complex relational join query is no longer performing and there are no additional optimization gains to be made to the query or relational structure.
Defining graph schema and queries with Gremlin
Let’s take a look at how to get started using a graph database using a real example, the recommender system we recently added to KillrVideo. KillrVideo is a reference application for sharing and watching videos that we built to help developers learn how to use DataStax Enterprise, including DataStax Enterprise Graph, a graph database built on top of highly scalable data technologies including Apache Cassandra and Apache Spark.
The language used for describing and interacting with graphs in DataStax Enterprise Graph is Gremlin, which is part of the Apache TinkerPop project. Gremlin is known as the go-to language for describing graph traversals due to its flexibility, extensibility, and support for both declarative and imperative queries. Gremlin is based on the Groovy language, and drivers are available in multiple languages. Most importantly, Gremlin is supported by most popular graph databases including DataStax Enterprise Graph, Neo4j, AWS Neptune, and Azure Cosmos DB.
We designed a recommendation algorithm to identify the data we would need as input. The approach was to generate recommendations for a given user based on videos that were liked by similar users. Our goal was to generate recommendations in real-time as users interact with the KillrVideo application, i.e. as an OLTP interaction.
To define the schema, we identified a subset of the data managed by KillrVideo that we needed for our graph. This included users, videos, ratings, and tags, as well as properties of these items that we might reference in the algorithm or present in recommendation results. We then created a graph schema in Gremlin that looked like this:
// create vertex labels
properties(“userId”, “email”, “added_date”).ifNotExists().create();
properties(“videoId”, “name”, “description”, “added_date”,
// create edge labels
We chose to model users, videos, and tags as vertices, and used edges to identify which users uploaded which videos, user ratings of videos, and the tags associated with each video. We assigned properties to vertices and edges that are referenced in queries or included in results. The resulting schema looks like this in DataStax Studio, a notebook-style developer tool for developing and executing queries in CQL and Gremlin.
Based on this schema, we defined queries that populate data into the graph and queries that retrieve data from the graph. Let’s look at a graph query that generates recommendations. Here’s the basic flow: For a given user, identify similar users who liked videos the given user liked, select videos those similar users also liked, exclude videos the given user has already watched, order those videos by popularity, and provide the results.
def numRatingsToSample = 1000
def localUserRatingsToSample = 10
def minPositiveRating = 4
def userID = ...
g.V().has(“user”, “userId”, userID).as(“^currentUser”)
// get all of the videos the user watched and store them
// go back to the current user
// identify the videos the user rated highly
// what other users rated those videos highly?
// limit the number of results so this will work as an OLTP query
// sort by rating to favor users who rated those videos the highest
// eliminate the current user
Let’s pause for a moment to catch our breath. So far in this traversal we have identified similar users. The second part of the traversal takes those similar users, grabs a limited number of videos those similar users liked, removes videos the user has already watched, and generates a result set sorted by popularity.
// select a limited number of highly rated videos from each similar user
// exclude videos the user has already watched
// identify the most popular videos by sum of all ratings
// now that we have a big map of [video: score], order it
.order(local).by(values, decr).limit(local, 100).select(keys).unfold()
// output recommended videos including the user who uploaded each video
While this traversal looks complicated, keep in mind that it is the entire business logic of a recommendation algorithm. We won’t dig into each step of this traversal in detail here, but the language reference is a great resource, and high quality training courses are available.
I recommend developing traversals interactively over a representative data set using a tool such as DataStax Studio or the Gremlin console from Apache TinkerPop. This allows you to quickly iterate and refine your traversals. DataStax Studio is a web-based environment that provides multiple ways to visualize traversal results as networks of nodes and edges, as shown in the picture below. Other supported views include tables, charts and graphs, as well as performance tracing.
Incorporating a graph database into your architecture
Once you have designed your graph schema and queries, it’s time to integrate the graph into your application. Here’s how we integrated DataStax Enterprise Graph into KillrVideo. KillrVideo’s multi-tier architecture consists of a web application that sits on top of a set of microservices that manage users, videos (including tags), and ratings. These services leverage the DataStax Enterprise Graph database (built on Apache Cassandra) for data storage and access the data using CQL.
We implemented our recommendation engine as part of the Suggested Videos Service, as shown below. This service generates a list of recommendations given a user ID. To implement the recommendation engine, we translated the Gremlin traversal described above into Java code.
This architecture highlights a frequent challenge in microservice architectures—the need to interact with data owned by multiple services. As shown above, the graph used to generate recommendations relies on data from the User Management, Video Catalog, and Ratings services.
We preserved the data ownership of our existing services by using asynchronous messaging. The User Management, Video Catalog, and Ratings services publish events on data changes. The Suggested Videos Service subscribes to these events and makes corresponding updates to the graph. The tradeoffs we’ve made here are typical of applications that use a multi-model approach, a topic I explored in my previous article.
Implementing Gremlin traversals in Java
The DataStax Java Driver provides a friendly, fluent API for implementing Gremlin traversals with DataStax Enterprise Graph. The API made it trivial to migrate Groovy-based queries we created in DataStax Studio into Java code.
We were then able to make our Java code even more readable and maintainable by using a Gremlin feature known as DSLs, domain specific languages. A DSL is an extension of Gremlin into a specific domain. For KillrVideo, we created a DSL to extend the Gremlin traversal implementation with terms that are relevant to the video domain. The
KillrVideoTraversalDsl class defines query operations such as u
ser(), which locates the vertex in the graph with a provided UUID, and
recommendByUserRating(), which generates recommendations for a provided user based on parameters such as a minimum rating and a requested number of recommendations.
Using a DSL simplified the implementation of the Suggested Videos Service to something like the sample below, which creates a
GraphStatement that we then execute using the DataStax Java Driver:
GraphStatement gStatement = DseGraph.statementFromTraversal(killr.users(userIdString)
.recommendByUserRating(100, 4, 500, 10)
Using a DSL allowed us to hide some of the complexity of our graph interactions in reusable functions, which can then be combined as needed to form more complex traversals. This will allow us to implement additional recommendation engines that begin from a selected user vertex provided by the
user() method and allow the application to swap between the different implementations.
A working graph example
You can see the results of our integration of DataStax Enterprise Graph into KillrVideo on the “Recommended for you” section of the web application shown below. Try it out for yourself at http://www.killrvideo.com by creating an account and rating a few videos.
I hope that this article gives you some great ideas on how a graph database might make sense for your application, and how to get started with Gremlin and DataStax Enterprise Graph.
Jeff Carpenter is a technical evangelist at DataStax, where he leverages his background in system architecture, microservices, and Apache Cassandra to help empower developers and operations engineers build distributed systems that are scalable, reliable, and secure. Jeff is the author of Cassandra: The Definitive Guide, 2nd Edition.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.