Review: Connect your data better with Neo4j

Designed for linking relationships, the Neo4j graph database combines speed, ease, and extreme flexibility, though the query language may take some getting used to

Connections
Thinkstock
At a Glance
  • Neo Technology Neo4j 2.1.3

A graph database seems a natural fit for the sort of data being produced by this world of social networks. Bob knows Frank, and Frank knows Suzie. You can represent each by a node in the graph database, and you can represent their “knowing” by relationships (arcs) connecting those nodes. You can even capture how well Bob knows Frank or how long Frank has known Suzie. Extend the database to their friends’ friends, to friends twice- and thrice-removed, and so on, and pretty soon you have a database for something like Facebook or LinkedIn or countless other social networking sites.

Other kinds of nodes can go into the database, too: clothing stores, pubs, restaurants, coffee shops, you name it. Suzie likes Tony's Pizza, so there’s another relationship in the database. And because Frank knows Suzie, an application might scan the database and recommend Tony's to Frank based on his knowing Suzie as well as other similar relationships the two share. You get the idea.

Neo4j from Neo Technology is a graph database system that you might use to build the system described above. Of course, Neo4j can be used to capture more than personal acquaintances and shopping or eating preferences. How about the connections among subway stations in a city? Or genealogical trees? Any data set whose elements are in some way connected or related is a good candidate for Neo4j. In addition to social networking and recommendations systems, common applications for Neo4j include asset management, network management, master data management, and identity-based access control.

The community edition of Neo4j is free, though it lacks many of the scaling, clustering, and backup features of the paid-for Enterprise edition. Also, the only support you’ll get with the community edition will come from, well, the Neo4j community. In addition, a free personal license is available, which adds clustering and backup capabilities, but no support. (See www.neo4j.com/subscriptions for details of the different versions.)

Written in Java, Neo4j is available in both embedded and server forms. The embedded version is suitable for small, targeted applications, while the server is for large-scale, clustered environments managing many simultaneous clients.

Nodes and relationships

Before continuing with Neo4j’s features, let's introduce some terminology. Discussing a graph database using terms from the relational world is difficult, as the parallels are so tenuous.

A graph consists of nodes connected to one another by relationships. A node (sometimes called a vertex) is typically used to model an entity, which can be a real-world object like a person, a city, or a subway station, or an abstract concept such as an idea, a theorem, or a physical law. Relationships are connections between nodes; if you’re familiar with graphs in the mathematical world, a relationship corresponds to an arc (sometimes called an edge). In Neo4j, relationships are unidirectional. Thus, a "likes" relationship directed from Frank to Suzie tells us only that Frank likes Suzie; it doesn't tell us whether Suzie likes Frank.

In Neo4j, a node can be adorned with one or more labels (specified as strings), which are generally used to specify a node’s type -- such as "person" or "restaurant." Labels can be quickly added or removed, so you can create temporary sets of nodes to reflect changing circumstances. You could, for example, quickly add the “OnVacation” label to a set of person nodes to exclude those nodes from certain queries while the people they represent are sunning themselves on a tropical beach.

Every relationship in Neo4j has a type, which is more or less equivalent to a node’s label, though any given relationship can have only one type. Examples of relationship types might be "likes" and "has eaten at."

Properties can be associated with both nodes and relationships. A property is a name/value pair, where the name is a string and the value is either a primitive (boolean, integer, char, string, and so on) or an array of primitives.

You can attach a property to a node to capture attributes of the node’s object. You might attach to a person node properties such as "name: Bob," "age: 24," "address: 1234 Oak St.," and so on. By attaching properties to a relationship, you enrich the relationship’s information content. To a "knows" relationship, you might add the property "Days: 30" to identify how long Bob has known Suzie.

To speed queries on nodes and relationships, Neo4j lets you define indexes on properties. Indexes are "eventually available" -- when you create an index the call returns immediately, but the index is constructed in the background, so there may be a delay before the index is available for use in queries.

Finally, a path is a collection of one or more nodes with connecting relationships. Typically, queries return paths. For example, the response to a query that asks for the shortest route between two subway stations would be a path between the initial and terminal nodes.

Schema optional

The engineers behind Neo4j describe it as "schema optional." If you want to define a new database and start filling it with nodes and relationships helter-skelter, you’re free to do that. However, if you need tighter control over your database’s structure, you can define constraints that enforce a sort of schema. Currently, only uniqueness constraints are supported, and only on node properties. For example, you could define a uniqueness constraint that requires that the value of property "name" be unique for all nodes with label "person."

The Neo4j database engine’s transaction-controlled access is fully ACID compliant. The system supports two-phase commit, which allows applications to incorporate external libraries (such as Apache Lucene) that can participate in Neo4j transactions.

By default, the Neo4j engine implements a read-committed isolation level for database transactions. Read-committed is not the most stringent isolation level, as it does not apply read locks on database elements read within a transaction. However, Neo4j provides "manual" locking of nodes and relationships, which means that an application can employ locks to bring the stricter repeatable read or serializable isolation levels to transactions.

As good as the transaction system and manual locking are (the lock manager provides deadlock detection), you must nevertheless think carefully about all update operations. For example, if you are creating a new relationship between two nodes, write locks are taken on both nodes involved. And Neo4j databases have their own kinds of integrity constraints. You can’t simply delete a node, as it might have one or more relationships connected to it (deleting the node would create a dangling reference). You must delete the relationships first, or else Neo4j will throw an exception when you try to delete the node.

While an application that uses the embedded library will access a Neo4j database through the Java API, client applications of a server installation will use Neo4j’s REST API. The REST API employs a discovery mechanism that allows clients to determine the endpoint handlers of various classes of operation. Send an HTTP GET request to a well-known URI (called the "service root") on the server, and the response will be a JSON document that maps other server-based URIs to the actions they support. For example, one entry in that JSON response might be the following:

“cypher” : “http://<hostname>:7474/db/data/cypher”

This indicates the URI to which a client application can send Cypher commands -- Cypher being Neo4j’s query language counterpart to SQL.

Like SQL, Cypher is a declarative language for performing CRUD (create, retrieve, update, and delete) operations on a database. Cypher queries -- like SQL’s -- are composed of clauses, and some Cypher clauses are very similar to their SQL counterparts. Cypher clauses can be chained, with intermediate results of enclosed clauses feeding the input of enclosing clauses.

Querying with Cypher

MATCH is Cypher’s equivalent to SQL’s SELECT. But because MATCH operates on a graph database, it uses pattern matching (hence its name). In a pattern you pass to MATCH, you can specify labels and properties, as well as the geometry of the path to match. Take the following Cypher command:

MATCH (me {name: “Rick”})-[:KNOWS *2..3]->(remote_friend)
RETURN remote_friend.name

It will return the name property of any nodes that are connected to Rick by a KNOWS relationship along a path length of at least 2 and at most 3.

Cypher also includes DDL (data definition language) commands. For example, the following could be used to add a node to a movie database. The node created has its title property set to "Shrek."

CREATE (movie: Movie {title: “Shrek”})

MERGE, a combination of MATCH and CREATE, enables atomic updates to the database. If the pattern specified by the MERGE already exists, the clause returns that pattern. Otherwise, it adds the pattern to the database. In Neo4j parlance, this is described as the "put if absent" capability, and it handles situations where one or more clients might attempt to write the same unique entity. Only one client will succeed. Other clients will block until the transaction of the winning client completes ... at which time the “losers” are passed a reference to the entity that the winner created.

Other Cypher commands that are more SQL-like are ORDER BY, which lets you sort the output of a query (though it only applies to properties), and WHERE, which lets you specify patterns to filter the results of a MATCH.

Neo4j console

The Neo4j server presents a GUI console that can be manipulated from a browser. From the console, you can enter Cypher commands, view database status and statistics, and more. It even provides an interactive graphical display of the database itself.

Neo4j clusters

The server form of Neo4j boasts several characteristics that support scaling to large numbers of clients, as well as allowing you to build high-availability databases. Though you can run Neo4j in a cluster, data is not distributed among the cluster’s members as in other clustered databases. Data in a graph database is highly connected, and therefore does not lend itself to easy partitioning. Nevertheless, the Neo4j engineers are working on a future version that will support data partitioning.

Currently, a clustered Neo4j installation consists of a single read/write master and multiple read-only slaves. The database is completely mirrored on every cluster member. Read throughput is high, as any slave can service a read request. However, write operations are processed only by the master, which effectively serializes updates to the database.

Such a clustered system provides high availability in that a failed slave can quickly be restarted and brought back online. Or because all slaves are identical, a new slave could be quickly cloned from any existing slave and inserted into the cluster to replace the failed member. Finally, should the master fail, the slaves will elect a replacement from among their ranks.

Neo4j works to improve data throughput via a multilevel caching scheme. The upper level is devoted to read operations. It caches logical entities such as nodes and relationships. The lower level is devoted to write operations. It caches physical disk blocks.

Should you need to quickly add a large amount of data to a Neo4j database -- say, import the contents of an RDBMS -- you can use the batch inserter, which operates directly on the database files, so it has no transactional protection. Consequently, it’s fast, but it can be used only when it is the only write thread accessing the database.

When you install a Neo4j server, it creates a GUI management console on port 7474. Point a browser there, and you can open panels to retrieve database statistics, enter Cypher commands and REST API commands, and even view a D3-generated graphical representation of your database.

If you prefer command-line access, the installation also provides a Neo4j shell. From the shell, you can execute Cypher queries or fetch metadata about your database. You can even browse your database using commands that mimic Linux shell commands. To traverse to node A, enter cd A. To see your current location in your database, enter pwd. To list relationships emanating from the current node, enter ls.

Neo4j’s online documentation is excellent, and it's embedded with numerous interactive tutorials. You can exercise Cypher code in your browser, and the results are displayed graphically using the popular D3 visualization library. You can even manipulate the displayed force-directed graphs live.

A single server instance of the Neo4j community edition is easy to install, and -- conceptually -- the basics of a Neo4j database are easy to grasp. Finding your sea legs is faster than with other NoSQL databases. Cypher requires practice and experimentation, but the tutorials are there to help. In addition, you’ll find a healthy and growing gallery of example applications online.

When it comes to tracking relationships, Neo4j puts so-called relational databases to shame. A native graph database management system, Neo4j is optimized for storing and processing large numbers of connections among large numbers of entities. Further, Neo4j offers the schema flexibility we've come to expect from NoSQL databases, allowing the data set to be extended with new kinds of entities and relationships at any time.

If you find your current data doesn’t quite fit the database you’re using, perhaps a graph database is the answer. The best way to discover if that’s true is to download a copy of Neo4j and get graphing.

InfoWorld Scorecard
Ease of use (20%)
Scalability (20%)
Documentation (15%)
Installation and setup (15%)
Value (10%)
Administration (20%)
Overall Score
Neo4j 2.1.3 8 7 9 9 9 0 6.6
At a Glance
  • When the focus of the application is on relationships, associations, or connections, a graph database may be the best way to store the data. Neo4j is a free, easy, and capable option.

    Pros

    • ACID transactions
    • Small footprint supports use as an embedded database
    • Excellent interactive documentation

    Cons

    • Not fully distributed (though data is replicated for reads)
    • Query language can be challenging for RDBMs users

Copyright © 2014 IDG Communications, Inc.

InfoWorld Technology of the Year Awards 2023. Now open for entries!