Pinterest has rolled out an open source solution to make it easier to query large data sets when working with Hadoop.
Terrapin, announced at Facebook's @Scale conference, was originally devised by Pinterest to replace the scalable Hadoop data store HBase. The idea was to provide a fast way to store and run key-value queries against large immutable data sets generated by MapReduce jobs or stored in S3 or HDFS volumes.
According to the blog post describing Terrapin, Pinterest ran into serious performance bottlenecks when writing hundreds of gigabytes of data to HBase. Bulk uploading solved the performance problem but caused the inserted data to be scattered across a cluster, which hurt performance all over again.
Terrapin uses HBase's existing HFile format for data storage on top of HDFS, but provides its own custom system for figuring out where queried HFile data lives and serving it up from the needed node.
Data fed into Terrapin can come from a variety of sources -- for example, a MapReduce job that writes directly to Terrapin or HDFS/S3/Hive sources. HFiles can also be "live swapped," with a newer data set replacing an older one on the fly.
Much of the Hadoop infrastructure in use leverages MapReduce, but interest and pressure are mounting to make Spark the new centerpiece for Hadoop. The current integration between Terrapin and Spark consists largely of having Spark write HFiles, but Terrapin's storage format system is extensible. In theory, any number of Hadoop data storage formats that can be queried via keys (Parquet, for instance) could get their own connectors in time.