With manual sharding, the registrations are distributed across a series of check-in booths. This works because there is a well-defined, pre-determined scheme. There are no guarantees, however, that the data or registrants will be distributed evenly or that the booths can be easily expanded without reshuffling all the registrations. Furthermore, the shutdown of a single booth (equivalent to a node failure) requires reshuffling across the other booths.
Alternatively, suppose you walk into the same conference and are given the option to check in at any open registration booth. This scenario is similar to how distributing data through autosharding works, where data is automatically hashed in the cluster for even distribution. Data can be accessed from any node by an algorithm that routes the request to the particular node.
Note, however, that node failures are quite common. Large scale-out architectures must operate on the assumption that several nodes being in a failed state at any given time is normal operating procedure.
If nodes in a scale-out architecture fail, a well-designed architecture can enable processing to continue across the rest of the cluster. Here, autosharding has a clear advantage; with manual sharding, replicating and moving data around during node failures -- that is, rebalancing data -- can be a significant challenge. As data sets become very large, rebalancing a manually sharded deployment carries a very high overhead, and an autosharding scheme becomes a must.
Querying large datasets
The contrast between manual and autosharding also emerges in querying large data sets. Think of the registration analogy again. With manual sharding, if you want to find all individuals at the conference with a last name beginning with "S," you would only have to go to a single check-in booth to determine those names.
With autosharding, if you wanted to find the same information, you would have to go to each booth to determine who checked in with a last name beginning with "S" to pull the same data. Typically, map-reduce techniques are used to accomplish this.
Alternatively, the query might not be all last names beginning with "S," but rather all members of a certain group ("get all registrants from a specific organization"). This is where secondary indexing and querying on a distributed data set becomes a key challenge.
Key direct access patterns in a document database require accessing data through a single key and navigating to another document through a related key. For example, you might pull up a certain order with an order ID, then navigate to the customer or account data by using an account ID. Unlike with a relational database, it's unlikely that a given object will be scattered across multiple nodes requiring an expensive join in a sharded scale-out architecture just to re-create that single object.