InfoWorld review: Databases primed for social networks
Neo4j, Cassandra, and FluidDB represent a breed of databases that swiftly search social networking dataFollow @peterwayner
All in all, Neo4j is an exciting tool that's just starting to be really useful. The fun comes when you start imagining what all of the other graph algorithms can do. There are implementations of the shortest path algorithms that would help a genealogist, a forensic accountant, and many others playing with social networks. These are just the beginning, and I expect there will be a bit of a renaissance as Website developers start unlocking some of the more arcane graph algorithms developed by computer scientists over the years.
Not all of the queries on a social network require the sophistication of a tool such as Neo4j, because not all searches require deep trips through the graph. Many of the simplest involve intersections and unions of the information attached to various nodes.
Digg, for instance, wanted a symbol to appear beside a link each time it was "dugg" by a user. This simple intersection, however, is complicated by the huge mass of information flowing through Digg, thus making conventional approaches with JOINs of relational tables too slow, even with good indexing.
Digg's solution has been to use a more write-friendly (nontransactional) environment to write out multiple versions of the data. Instead of computing the intersection at query time with a JOIN, it just precomputes the information for all but the most loved pages. The moment a person "diggs" a link, the denormalization process begins: The app precomputes the JOIN by inserting a mention into the lists of all of the followers of that user, effectively shifting the computational load. That means if someone with 10,000 followers likes a link, there will be 10,000 different entries written at that time.
Digg uses Cassandra, a NoSQL database that promises to be "eventually consistent" -- that is, the update doesn't occur immediately in all instances, which is sufficient for something as ephemeral as a link to an article. Facebook, the original developers of Cassandra, often gives me wildly inaccurate versions of my newsfeed, featuring old articles from odd times. It's not a big deal, though, because it's just friendly chatter.
[ In InfoWorld's "Slacker databases break all the old rules," Peter Wayner reviews four NoSQL databases: Amazon SimpleDB, CouchDB, Google App Engine, and Persevere of NoSQL. ]
If something like this happened when I accessed my bank account online, for example, I would be angry. But the lack of sophistication in adding information to the database means that it doesn't take long to add the tens of thousands of links.
I've enjoyed working with Cassandra and the other new NoSQL databases for some time. The limitations and inaccuracies are often acceptable when the data is as expendable as many of the text strings floating around social networks, especially if the higher speeds make it possible to do some of the pre-computation of JOINs.
The denormalization can also chew through disk space because the data is repeated ad nauseum, but this is less of a problem now that disk space is so cheap. Whereas services such as Digg and Twitter can use these kinds of techniques to speed up answer delivery, they still face the theoretical problem of quadratic growth: If everyone starts following everyone else on Twitter, the load is a disaster.
Cassandra is an excellent tool, and there are a number of similar low-rent databases that can support this kind of approach. MongoDB and CouchDB are also popular.