Open source Hive: Large-scale, distributed data processing made easy
Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tablesFollow @infoworld
More HQL tricks
HQL is designed to be easily mastered by anyone already familiar with SQL. Though HQL is definitely a subset of SQL, it provides a surprising amount of SQL-like functionality. Hive's DDL includes commands for creating and dropping tables as well as altering table structure (adding or replacing columns). Tables can also be created with partition specifiers, which -- if strategically arranged -- can accelerate some queries. HQL's SELECT clause supports subqueries, as well as GROUP BY and SORT BY clauses. Also, you can perform multiple JOIN operations in an HQL query (though only the equality operator can be used in the JOIN conditional).
Other HQL language features have no direct SQL counterpart, but are understandable deviations when you consider HQL's raison d'etre. For example, if you already have a large table imported into Hive and want to test a query you've just written, but would rather not wait the hour you suspect the query will take, you can use Hive's TABLESAMPLE clause. Applied in conjunction with the CREATE command's CLUSTERED BY clause, adding the TABLESAMPLE clause to a query's FROM clause will involve only a subset of the entire table's data in that query, thereby reducing query execution time significantly.
[ Stay up to date on the latest open source developments with InfoWorld's Technology: Open Source newsletter. ]
Finally, if you want to add a new, user-defined function to HQL, Hive provides a plug-in mechanism whereby you can write your function (it will have to be in Java), compile it into a JAR file, and register it with the Hive infrastructure. Restart Hive, and your function is ready to use in your Hive queries.
Join the Hive
Hive is easy to install, and HQL is easy to pick up if you already know even a modest amount of SQL. And Hive has a bright future; the road map of upcoming features includes more support for languages other than Java, a HAVING clause, improvements to Hive's JOIN capabilities, additional data types, indexes, and much more.
Hive, however, is not a replacement for an RDBMS. As already mentioned, Hive does not support random row insertion or deletion. The Hive Web site makes it clear that Hive is a tool for the analysis and summarization of large datasets; it is not meant for structured, randomly accessed content storage.
Hadoop is emerging as the current darling of the cloud computing crowd, and Hive certainly assists that ascent. Creating Hadoop map/reduce tasks demands programming skills that Hive does not require (though some map/reduce jobs will always necessitate hand-coding). Still, Hive is an ideal express-entry into the large-scale distributed data processing world of Hadoop. All the ease of SQL with all the power of Hadoop -- sounds good to me.
Thanks to Facebook engineers Joydeep Sen Sarma and Ashish Thusoo for their assistance with this article.
- First look: Amazon brings MapReduce to the Elastic Cloud
- Inside Amazon Web Services
- Hooking your apps into Amazon Web Services
- Slacker databases break all the old rules
- Cloud control systems tame the ether
- Busting the nine myths of cloud computing
- What to do if your cloud provider disappears
- Cloud options for IT that IT will love
- What cloud computing really means
Read more about cloud computing in InfoWorld's Cloud Computing Channel.