What is object storage?
Objects differ from files in one key respect: They are immutable once written. They can be updated and overwritten in their entirety, but the in-place update of a POSIX or similar file system is verboten. In practice this is not a severe constraint, especially given that most object data itself is immutable. DNA rarely mutates, log tampering is bad, cat videos are remixed and republished. Besides, mutable data winds up in a richly indexed DBMS, which itself may reside on an object or file store.
Enforcing immutability yields architectural simplicity. A massively distributed storage system across networked nodes must contend with network congestion and component failures. A technically seductive approach is to distribute a single object's data across nodes, with erasure-coded partial redundancy to survive one or more node failures at reasonable economic investment. The limits of physics and metaphysics are captured in Brewer's CAP theorem and elsewhere like in Abadi's PACELC refinement.
These acronyms themselves express the trade-offs inherent in a distributed architecture. "CAP" stands for "consistency or availability under network partition"? "PACELC" is "partition, availability, or consistency, else latency or consistency"? Without going into the math, data writes orchestrated at the grain of an entire object are synchronous (consistency first) or asynchronous (availability first).
Unwittingly, these popular redundant-array-of-independent-nodes (RAIN) architectures force an expensive reconstitution operation across the network for each object read. Furthermore, no meaningful computation can be performed at a storage node because each node has only a fractional view, tesselated arbitrarily by the erasure code chosen. By design, this popular object storage architecture requires non-negotiable network bandwidth to operate on an object.
Joyent Manta Storage and Compute Architecture
The Joyent Manta Storage Service's architecture is a departure from storage-only approaches. Instead, a high-performance compute cluster is included in each storage node. We aimed for a converged storage-and-compute design goal.
Rather than erasure-coding across nodes, Manta uses erasure codes across disks (currently 9+2 across three stripes with three hot standby disks, if you must know). It also uses a default multiple-data-center full-copy redundancy to achieve equal data durability to RAIN systems. The economic overhang for a default two copies is surprisingly negligible compared to common RAIN architectures for standard redundancy services. This suggests a two-fold redundancy or a rich gross margin associated with these services.