Why you should use a graph database

Graph databases excel for apps that explore many-to-many relationships, such as recommendation systems. Let's look at an example

1 2 Page 2
Page 2 of 2

Defining graph schema and queries with Gremlin

Let’s take a look at how to get started using a graph database using a real example, the recommender system we recently added to KillrVideo. KillrVideo is a reference application for sharing and watching videos that we built to help developers learn how to use DataStax Enterprise, including DataStax Enterprise Graph, a graph database built on top of highly scalable data technologies including Apache Cassandra and Apache Spark.

The language used for describing and interacting with graphs in DataStax Enterprise Graph is Gremlin, which is part of the Apache TinkerPop project. Gremlin is known as the go-to language for describing graph traversals due to its flexibility, extensibility, and support for both declarative and imperative queries. Gremlin is based on the Groovy language, and drivers are available in multiple languages. Most importantly, Gremlin is supported by most popular graph databases including DataStax Enterprise Graph, Neo4j, AWS Neptune, and Azure Cosmos DB.

We designed a recommendation algorithm to identify the data we would need as input. The approach was to generate recommendations for a given user based on videos that were liked by similar users. Our goal was to generate recommendations in real-time as users interact with the KillrVideo application, i.e. as an OLTP interaction.

To define the schema, we identified a subset of the data managed by KillrVideo that we needed for our graph. This included users, videos, ratings, and tags, as well as properties of these items that we might reference in the algorithm or present in recommendation results. We then created a graph schema in Gremlin that looked like this:

// create vertex labels
schema.vertexLabel(“user”).partitionKey(‘userId’).
  properties(“userId”, “email”, “added_date”).ifNotExists().create();
schema.vertexLabel(“video”).partitionKey(‘videoId’).
  properties(“videoId”, “name”, “description”, “added_date”,
preview_image_location”).ifNotExists().create();
schema.vertexLabel(“tag”).partitionKey(‘name’).
  properties(“name”, “tagged_date”).ifNotExists().create();

// create edge labels
schema.edgeLabel(“rated”).multiple().properties(“rating”).
  connection(“user”,”video”).ifNotExists().create();
schema.edgeLabel(“uploaded”).single().properties(“added_date”).
  connection(“user”,”video”).ifNotExists().create();
schema.edgeLabel(“taggedWith”).single().
  connection(“video”,”tag”).ifNotExists().create();

We chose to model users, videos, and tags as vertices, and used edges to identify which users uploaded which videos, user ratings of videos, and the tags associated with each video. We assigned properties to vertices and edges that are referenced in queries or included in results. The resulting schema looks like this in DataStax Studio, a notebook-style developer tool for developing and executing queries in CQL and Gremlin.

Based on this schema, we defined queries that populate data into the graph and queries that retrieve data from the graph. Let’s look at a graph query that generates recommendations. Here’s the basic flow: For a given user, identify similar users who liked videos the given user liked, select videos those similar users also liked, exclude videos the given user has already watched, order those videos by popularity, and provide the results.

def numRatingsToSample = 1000
def localUserRatingsToSample = 10
def minPositiveRating  = 4
def userID = ...

g.V().has(“user”, “userId”, userID).as(“^currentUser”)
    // get all of the videos the user watched and store them
    .map(out(‘rated’).dedup().fold()).as(“^watchedVideos”)
    // go back to the current user
    .select(“^currentUser”)
    // identify the videos the user rated highly
    .outE(‘rated’).has(‘rating’, gte(minPositiveRating)).inV()
    // what other users rated those videos highly?
    .inE(‘rated’).has(‘rating’, gte(minPositiveRating))
    // limit the number of results so this will work as an OLTP query
    .sample(numRatingsToSample)
    // sort by rating to favor users who rated those videos the highest
    .by(‘rating’).outV()
    // eliminate the current user
    .where(neq(“^currentUser”))

Let’s pause for a moment to catch our breath. So far in this traversal we have identified similar users. The second part of the traversal takes those similar users, grabs a limited number of videos those similar users liked, removes videos the user has already watched, and generates a result set sorted by popularity.

    // select a limited number of highly rated videos from each similar user
   .local(outE(‘rated’).has(‘rating’, gte(minPositiveRating)).limit(localUserRatingsToSample)).sack(assign).by(‘rating’).inV()
     // exclude videos the user has already watched
    .not(where(within(“^watchedVideos”)))
    // identify the most popular videos by sum of all ratings
    .group().by().by(sack().sum())
    // now that we have a big map of [video: score], order it
    .order(local).by(values, decr).limit(local, 100).select(keys).unfold()
    // output recommended videos including the user who uploaded each video
    .project(‘video’,’user’)
        .by()
        .by(__.in(‘uploaded’))

While this traversal looks complicated, keep in mind that it is the entire business logic of a  recommendation algorithm. We won’t dig into each step of this traversal in detail here, but the language reference is a great resource, and high quality training courses are available.

I recommend developing traversals interactively over a representative data set using a tool such as DataStax Studio or the Gremlin console from Apache TinkerPop. This allows you to quickly iterate and refine your traversals. DataStax Studio is a web-based environment that provides multiple ways to visualize traversal results as networks of nodes and edges, as shown in the picture below. Other supported views include tables, charts and graphs, as well as performance tracing.

datastax studio DataStax

Incorporating a graph database into your architecture

Once you have designed your graph schema and queries, it’s time to integrate the graph into your application. Here’s how we integrated DataStax Enterprise Graph into KillrVideo. KillrVideo’s multi-tier architecture consists of a web application that sits on top of a set of microservices that manage users, videos (including tags), and ratings. These services leverage the DataStax Enterprise Graph database (built on Apache Cassandra) for data storage and access the data using CQL.

We implemented our recommendation engine as part of the Suggested Videos Service, as shown below. This service generates a list of recommendations given a user ID. To implement the recommendation engine, we translated the Gremlin traversal described above into Java code.

suggested videos service DataStax

This architecture highlights a frequent challenge in microservice architectures—the need to interact with data owned by multiple services. As shown above, the graph used to generate recommendations relies on data from the User Management, Video Catalog, and Ratings services.

We preserved the data ownership of our existing services by using asynchronous messaging. The User Management, Video Catalog, and Ratings services publish events on data changes. The Suggested Videos Service subscribes to these events and makes corresponding updates to the graph. The tradeoffs we’ve made here are typical of applications that use a multi-model approach, a topic I explored in my previous article.

Implementing Gremlin traversals in Java

The DataStax Java Driver provides a friendly, fluent API for implementing Gremlin traversals with DataStax Enterprise Graph. The API made it trivial to migrate Groovy-based queries we created in DataStax Studio into Java code.

We were then able to make our Java code even more readable and maintainable by using a Gremlin feature known as DSLs, domain specific languages. A DSL is an extension of Gremlin into a specific domain. For KillrVideo, we created a DSL to extend the Gremlin traversal implementation with terms that are relevant to the video domain. The KillrVideoTraversalDsl class defines query operations such as user(), which locates the vertex in the graph with a provided UUID, and recommendByUserRating(), which generates recommendations for a provided user based on parameters such as a minimum rating and a requested number of recommendations.

Using a DSL simplified the implementation of the Suggested Videos Service to something like the sample below, which creates a GraphStatement that we then execute using the DataStax Java Driver:

GraphStatement gStatement = DseGraph.statementFromTraversal(killr.users(userIdString)
       .recommendByUserRating(100, 4, 500, 10)
);

Using a DSL allowed us to hide some of the complexity of our graph interactions in reusable functions, which can then be combined as needed to form more complex traversals. This will allow us to implement additional recommendation engines that begin from a selected user vertex provided by the user() method and allow the application to swap between the different implementations.

A working graph example

You can see the results of our integration of DataStax Enterprise Graph into KillrVideo on the “Recommended for you” section of the web application shown below. Try it out for yourself at http://www.killrvideo.com by creating an account and rating a few videos.

killrvideo application DataStax

I hope that this article gives you some great ideas on how a graph database might make sense for your application, and how to get started with Gremlin and DataStax Enterprise Graph.

Jeff Carpenter is a technical evangelist at DataStax, where he leverages his background in system architecture, microservices, and Apache Cassandra to help empower developers and operations engineers build distributed systems that are scalable, reliable, and secure. Jeff is the author of Cassandra: The Definitive Guide, 2nd Edition.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

1 2 Page 2
Page 2 of 2