Relational databases have dominated data management for decades, but they've recently lost ground to NoSQL alternatives. While NoSQL data stores aren't right for every use case, they are generally better for big data, which is shorthand for systems that process massive volumes of data. Four types of data store are used for big data:
- Key/value stores such as Memcached and Redis
- Document-oriented databases such as MongoDB, CouchDB, and DynamoDB
- Column-oriented data stores such as Cassandra and HBase
- Graph databases such as Neo4j and OrientDB
This tutorial introduces Neo4j, which is a graph database used for interacting with highly related data. While relational databases are good at managing relationships between data, graph databases are better at managing n-th degree relationships. As an example, take a social network, where you want to analyze patterns involving friends, friends of friends, and so on. A graph database would make it easy to answer a question like, "Given five degrees of separation, what are five movies popular with my social network that I have not yet seen?" Such questions are common for recommendation software, and graph databases are perfect for solving them. Additionally, graph databases are good at representing hierarchical data, such as access controls, product catalogs, movie databases, or even network topologies and organization charts. When you have objects with multiple relationships, you'll quickly find that graph databases offer an elegant, object-oriented paradigm for managing those objects.
The case for graph databases
Like the name suggests, graph databases are good at representing graphs of data. This is especially useful for social software, where every time you connect with someone, a relationship is defined between you. Probably in your last job search, you picked a few companies that you were interested in and then searched your social networks for connections to them. While you might not know anyone working for one of those companies, someone in your social network likely does. Solving a problem like this is easy at one or two degrees of separation (your friend or a friend of a friend) but what happens when you start extending the search across your network?
In their book, Neo4j In Action, Aleksa Vukotic and Nicki Watt explore the differences between relational databases and graph databases for solving social network problems. I'm going to draw on their work for the next few examples, in order to show you why graph databases are becoming an increasingly popular alternative to relational databases.
Modeling complex relationships: Neo4j vs MySQL
From a computer science perspective, when we think about modeling relationships between users in a social network, we might draw a graph like the one in Figure 1.
Figure 1. Graphing relationships in a social network
A user has IS_FRIEND_OF
relationships with other users, and those users have IS_FRIEND_OF
relationships with other users, and so forth. Figure 2 shows how we'd represent this in a relational database.
Figure 2. Modeling a social graph in a relational database
The USER
table has a one-to-many relationship with the USER_FRIEND
table, which models the "friend" relationship between two users. Now that we've modeled the relationships, how would we query our data? Vukotic and Watt measured the query performance for counting the number of distinct friends going out to a depth of five levels (friends of friends of friends of friends of friends). In a relational database the queries would look as follows:
# Depth 1
select count(distinct uf.*) from user_friend uf where uf.user_1 = ?
# Depth 2
select count(distinct uf2.*) from user_friend uf1
inner join user_friend uf2 on uf1.user_1 = uf2.user_2
where uf1.user_1 = ?
# Depth 3
select count(distinct uf3.*) from t_user_friend uf1
inner join t_user_friend uf2 on uf1.user_1 = uf2.user_2
inner join t_user_friend uf3 on uf2.user_1 = uf3.user_2
where uf1.user_1 = ?
# And so on...
What is interesting about these these queries is that each time we go out one more level, we are required to join the USER_FRIEND
table with itself. Table 1 shows what researchers Vukotic and Watt found when they inserted 1,000 users with approximately 50 relationships each (50,000 relationships) and ran the queries.
Table 1. MySQL query response time for various depths of relationships
DepthExecution time (seconds)Count result
2 | 0.028 | ~900 |
3 | 0.213 | ~999 |
4 | 10.273 | ~999 |
5 | 92.613 | ~999 |
MySQL does a great job of joining data up to three levels away, but performance degrades rapidly after that. The reason is that each time the USER_FRIEND
table is joined with itself, MySQL must compute the cartesian product of the table, even though the majority of the data will be thrown away. For example, when performing that join five times, the cartesian product results in 50,000^5 rows, or 102.4*10^21 rows. That's a waste when we are only interested in 1,000 of them!
Next, Vukotic and Watt tried executing the same type of queries against Neo4j. These entirely different results are shown in Table 2.
Table 2. Neo4j response time for various depths of relationships
DepthExecution time (seconds)Count result
2 | 0.04 | ~900 |
3 | 0.06 | ~999 |
4 | 0.07 | ~999 |
5 | 0.07 | ~999 |
The takeaway from these execution comparisons is not that Neo4j is better than MySQL. Rather, when traversing these types of relationships, Neo4j's performance is dependent on the number of records retrieved, whereas MySQL's performance is dependent on the number of records in the USER_FRIEND
table. Thus, as the number of relationships increases, the response times for MySQL queries will likewise increase, whereas the response times for Neo4j queries will remain the same. This is because Neo4j's response time is dependent on the number of relationships for a specific query, and not on the total number of relationships.
Scaling Neo4j for big data
Extending this thought project one step further, Vukotic and Watt next created a million users with 50 million relationships between them. Table 3 shows results for that data set.
Table 3. Neo4j response time for 50 million relationships
DepthExecution time (seconds)Count result
2 | 0.01 | ~2,500 |
3 | 0.168 | ~110,000 |
4 | 1.359 | ~600,000 |
5 | 2.132 | ~800,000 |
Needless to say, I am indebted to Aleksa Vukotic and Nicki Watt and highly recommend checking out their work. I extracted all the tests in this section from the first chapter of their book, Neo4j in Action.
Getting started with Neo4j
You've seen that Neo4j is capable of executing massive amounts of highly related data very quickly, and there's no doubt it's a better fit than MySQL (or any relational database) for certain kinds of problems. If you want to understand more about how Neo4j works, the easiest way is to interact with it through the web console.
Start by downloading Neo4j. For this article, you'll want the Community Edition, which as of this writing is at version 3.2.3.
- On a Mac, download a DMG file and install it as you would any other application.
- On Windows, either download an EXE and walk through an installation wizard or download a ZIP file and decompress it on your hard drive.
- On Linux, download a TAR file and decompress it on your hard drive.
- Alternatively, use a Docker image on any operating system.
Once you have installed Neo4j, start it up and open a browser window to the following URL:
http://127.0.0.1:7474/browser/
Login with the default username of neo4j
and the default password of neo4j
. You should see a screen similar to Figure 3.
Figure 3. Web Interface for Neo4
Nodes and relationships in Neo4j
Neo4j is designed around the concept of nodes and relationships:
- A node represents a thing, such as a user, a movie, or a book.
- A node contains a set of key/value pairs, such as a name, a title, or a publisher.
- A node's label defines what type of thing it is--again, a User, a Movie, or a Book.
- Relationships define associations between nodes and are of specific types.
As an example, we might define Character nodes such as Iron Man and Captain America; define a Movie node named "Avengers"; and then define an APPEARS_IN
relationship between Iron Man and Avengers and Captain America and Avengers. All of this is shown in Figure 4.
Figure 4. Nodes and relationships
Figure 4 shows three nodes (two Character nodes and one Movie node) and two relationships (both of type APPEARS_IN
).
Modeling and querying nodes and relationships
Similar to how a relational database uses Structured Query Language (SQL) to interact with data, Neo4j uses Cypher Query Language to interact with nodes and relationships.
Let's use Cypher to create a simple representation of a family. At the top of the web interface, look for the dollar sign. This indicates a field that allows you to execute Cypher queries directly against Neo4j. Enter the following Cypher query into that field (I'm using my family as an example, but feel free to change the details to model your own family if you like):
CREATE (person:Person {name: "Steven", age: 45}) RETURN person
The result is shown in Figure 5.
Figure 5. Creating a Person with Cypher Query Language
In Figure 5 you can see a new node with the label Person and the name Steven. If you hover your mouse over the node in your web console, you will see its properties at the bottom. In this case, the properties are ID: 19, name: Steven, and age: 45. Now let's break down the Cypher query:
- CREATE: The
CREATE
keyword is used to create nodes and relationships. In this case, we pass it a single argument, which is aPerson
enclosed in parentheses, so it is meant to create a single node. - (person: Person {...}): The lower case "
person
" is a variable name through which we can access the person being created, while the capital "Person
" is the label. Note that a colon separates the variable name from the label. - {name: "Steven, age: 45}: These are the key/value properties that we're defining for the node we're creating. Neo4j does not require you to define a schema before creating nodes and each node can have a unique set of elements. (Most of the time you define nodes with the same label to have the same properties, but it is not required.)
- RETURN person: After the node is created, we ask Neo4j to return it back to us. This is why we saw the node appear in the user interface.
The CREATE
command (which is case insensitive) is used to create nodes and can be read as follows: create a new node with the Person label that contains name and age properties; assign it to the person variable and return it back to the caller.
Querying with Cypher Query Language
Next we want to try some querying with Cypher. First, we'll need to create a few more people, so that we can define relationships between them.
CREATE (person:Person {name: "Michael", age: 16}) RETURN person
CREATE (person:Person {name: "Rebecca", age: 7}) RETURN person
CREATE (person:Person {name: "Linda"}) RETURN person
Once you've created your four people, you can either click on the Person button under the Node Labels (visible if you click on the database icon in the upper left corner of the web page) or execute the following Cypher query:
MATCH (person: Person) RETURN person
Cypher uses the MATCH
keyword to find things in Neo4j. In this example, we are asking Cypher to match all nodes that have a label of Person, assign those nodes to the person variable, and return the value that is associated with that variable. As a result you should see the four nodes that you've created. If you hover over each node in your web console, you will see each person's properties. (You might note that I excluded my wife's age from her node, illustrating that properties do not need to be consistent across nodes, even of the same label. I am also not foolish enough to publish my wife's age.)
We can extends this MATCH
example a little further by adding conditions to the nodes we want returned. For example, if we wanted just the "Steven" node, we could retrieve it by matching on the name property:
MATCH (person: Person {name: "Steven"}) RETURN person
Or, if we wanted to return all of the children we could request all people having an age under 18:
MATCH (person: Person) WHERE person.age < 18 RETURN person
In this example we added the WHERE
clause to the query to narrow our results. WHERE
works very similarly to its SQL equivalent: MATCH (person: Person)
finds all nodes with the Person label, and then the WHERE
clause filters values out of the result set.
Modeling direction in relationships
We have four nodes, so let's create some relationships. First of all, let's create the IS_MARRIED_TO
relationship between Steven and Linda:
MATCH (steven:Person {name: "Steven"}), (linda:Person {name: "Linda"}) CREATE (steven)-[:IS_MARRIED_TO]->(linda) return steven, linda
In this example we match two Person nodes labeled Steven and Linda, and we create a relationship of type IS_MARRIED_TO
from Steven to Linda. The format for creating the relationship is as follows:
(node1)-[relationshipVariable:RELATIONSHIP_TYPE->(node2)