whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
That is, sharding duplicates a database model (or schema) across multiple database servers (this is different than traditional partitioning, which is row based and usually on the same server). Consequently, one must then decide on what data to shard; for example, (as Wikipedia states) one may shard by customers — European customers live in shard one, US based customers in shard two, and so on. The obvious benefits are:
- increased read performance (simply put, there is less data to scan in a table)
- increased reliability (that is, if shard one from above goes down, conceivably, US based customers (in shard two) aren’t affected)
From a scalability standpoint, which has certainly become quite vogue in the age of Aquarius with applications like Twitter, Facebook, Flickr and the like, sharding appears to be an easily justifiable approach. Yet, before you decide to shard, you’d be wise to consider just what you’re getting yourself into.
Most sharding implementations are at the application level — that is, the database itself doesn’t know you’ve decided to shard its data. Accordingly, primary keys can become problematic. If you leave it to the database to generate keys (i.e. sequences) then the possibility of two shards having the same primary key is real. Thus, you’ll most likely need to leave primary key generation to the application. This isn’t such a bad thing and probably not an issue for most apps. Keys are the easy issue to solve, unfortunately.
Sharding works best when shards act in isolation. For instance, in keeping with Wikipedia’s example, so long as US customers don’t, in any way, relate to European customers, everything is hip. If, however, these two entities require joining, sharding becomes a dilemma: querying across shards is not easy. In fact, you must handle joining shards at the application level (because as stated earlier, the database doesn’t know about shards). And you had better have already decided to generate cross-shard unique keys at that level too.
As another example to further demonstrate this headache, the Grails sharding plugin is described with a simple application that stores users and comments. This problem domain works well so long as nothing ever changes; however, what if, in the future, this fictitious application needs to provide thread-able comments (so as to form a conversation)– that is, comments would need parents (i.e. comment #24 is in reply to comment #2).
One way of easily modeling this requirement is to store the comment parent id in the comment child (i.e. a property on comment), yet already this application is now broken because initially ids are “not unique across shards.” Thus, if this were a real life application (I know it’s not and is used simply for demonstration purposes) a decision to shard early on has already locked the application into a data storage model that requires a lot of heavy lifting to evolve. Think about it: because ids aren’t guaranteed to be unique across shards, what if a customer in shard one leaves a comment in reply to a comment from a customer in shard two? How would an application efficiently link (i.e. join) the two in either a read or a write?
I don’t dispute that the issue above can be solved; however, the resolution creates artificial complexity (which always yields higher defects), but most importantly, chances are, any solution to this problem most likely will obviate benefit #1 (reads are supposed to be fast!).
Lastly, sharding creates additional work when it comes to data management. Operational maintaince of shards is a lot of work, especially if shards require bifurcation (that is, what if your US customer shard is so big you need to break it into east and west sub-shards?) and/or migrations.
This isn’t to say sharding is nefarious. There are scenarios where sharding can increase application reads and writes; what’s more, sharding is certinaly in use in varying domains. But, as my friend Tim Berglund dared to muse:
relational sharding is a smell
In plain English: sharding without considering the long term consequences is dangerous. Think twice before sharding, especially if you’re considering it before all else. In fact, if you do find yourself considering sharding at the outset, then perhaps you should be looking at a NoSQL alternative to the relational model. Because it’s their bag, both Twitter and Facebook did.
This story, "Think twice before sharding" was originally published by JavaWorld.