InfoWorld review: Databases primed for social networks
Neo4j, Cassandra, and FluidDB represent a breed of databases that swiftly search social networking dataFollow @peterwayner
FluidDB is not a graph database, but it can still handle queries on social networks, thanks to a simple structure and a radical amount of openness. The database is designed to let the world cooperate on tagging data elements; this collaborative work can produce answers to questions.
The FluidDB structure is beguiling. Anyone can attach tags to any data object, but only people with the right roles can see and search these tags. If tags are added with a consistent structure, then boolean operations on these tags can produce accurate solutions to many of the questions that might occur on a social network.
Imagine that my Twitter reader would inject "Peter Wayner follows" tags pointing to all of the people that I follow. At the same time, other people's readers might inject similar tags pointing to their followers. Then FluidDB could answer questions such as, "Show me everyone followed by two specific people" or, for that matter, any boolean operations of these sets. The advantage of FluidDB is that everyone's reader is setting tags independently, yet everyone can search all of the tags. The "graph" is built up piece by piece by individuals, but the answers that come from intersections are open to all.
There are limits to this power. The queries can work on only one layer at a time, just like the Digg example using Project Cassandra. The query can't search through several layers without repeatedly asking questions of the database, refining the answer, and then sending another query. If Digg wants to put a special icon next to the posts of the friend of friends, a full graph database such as Neo4j would be required.
The lack of structure of the tags, though, means it's possible to essentially pre-compute some of the most complicated queries. If someone stops following a person on Twitter, the reader might add a "stopped following" tag, thus saving the trouble of subtracting or intersecting lists.
The interest in public/private databases such as FluidDB is just beginning, and the version I tested is merely an alpha. However, I imagine that the structure of the tags will develop organically, much in the same way that Twitter users started coming up with hash tags. The more I use Twitter these days, the more I wish it had an open and flexible structure along the lines of FluidDB.
Room to grow
All three of these solutions are just the first cut at tools that can answer questions about social networks fast enough to satisfy the needs of the people who have to know what their friends' friends are doing. They can't quite perform more complex tasks, however, such as computing sums and averages over result sets, at least not with a built-in command. You can implement many of these on your own.
There's plenty of room for improvement. Neo4j, for example, scans the nodes in the network, but it can't handle more complex queries; it wouldn't be able to find nodes with two and only two friends unless you start adding attributes and other features yourself. All it can do is scan the entire graph. That's why the database needs to be a more flexible indexing mechanism for finding nodes quickly, not just the text in nodes that's indexed by Lucene.
These improvements will probably be coming along soon, though I'm not sure whether the result will be simple and straightforward. Neo4j includes a number of plug-ins, and it would be simple to add some flexible indexing routines. The open source nature of the code means you can revise and extend it even if you buy a commercial license.