This week on the New Tech Forum, we're taking a look at the challenges of traditional storage and compute in the world of big data -- and the growing role of object storage and integrated compute resources.
Jason Hoffman, CTO and founder of cloud service provider Joyent, details how combining object storage and parallel compute clusters can make working with big data easier and faster by eliminating bottlenecks.
How objects and compute will eat the world
Networked storage vendors' days are numbered. Customers are fleeing to consolidated online object storage, and soon the digital object storage will surpass traditional file storage as the primary model for data outside of a DBMS. But there's a subtle and often unappreciated downside to most distributed object storage: data inertia. The implicit limits on moving huge data sets to in-network compute nodes deter business or clinical insights from surfacing.
At Joyent, we architected the Manta Storage and Compute Service -- "Manta," for short -- to be a best-in-class object store and an in-storage massively parallel compute cluster. It drives data latency effectively to zero, moving weekly or monthly jobs to an hourly or even an on-demand analytic cadence.
Whence big data?
Massive data volumes arise from machines (log files, API calls), digitized nature (DNA sequences, video, audio, environmental sensors), and the humanity of billions of people online (Facebook, Baidu, e-commerce). Take a mere 10 million patients' genomes, for example. That requires 20 exabytes (EB) of storage. Then there's camera phone resolution and market penetration, which has been growing exponentially. And according to Digital Marketing Ramblings, Twitter distributes 400 million updates per day to 500 million subscribers.
But in 2012, all enterprise storage vendors shipped just 16EB of capacity.
With the big data wherewithal to capture it all, we could be at the early stages of a deeply disruptive wave of innovation. This sweeping crush threatens business models and technical architectures that assumed a paucity of data and scarcity of places to put it.
The additional hidden cost to networked object storage is the implicit inertia in petabytes of recorded audio or e-commerce server logs. Namely, there is a need to move that data from its resting place to a computational node. Computation is necessary to glean the business insights, social relevancy, and clinical results that make saving digital ephemera worthwhile. That theoretical 1Gbps or even 40Gbps network is a severe limit to the class of algorithms that can be considered and the rate at which they can be applied.