Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

1 2 3 4 5 Page 4
Page 4 of 5

For example, let's say you want to do a query that's not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive. By using a UDF, your custom query logic could be embedded into your SQL query, saving those steps and also leveraging the query planning and caching infrastructure of Hive. (See Hivemall for a neat example of enabling machine learning in Hive via UDFs.)

Because UDFs run as part of the Hive query and have direct access to the data, they run efficiently and eliminate intermediate steps in the pipeline. UDFs must be written in Java, so they aren't for the average SQL programmer, but they fill an important niche in the Hive toolkit. Without them, certain classes of problems would be much more difficult to solve.

Hive query performance
Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive, and there's no question the effort was successful. I tested each of the major speed features from the 0.11 to 0.13 releases to see how much they added to performance. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework.

In my testing, I saw query times drop by half when run on Tez, and on queries that could be cached, times dropped another 30 percent. On larger data sets, the speedup was even more dramatic. Using the ORC (Optimized Row Columnar) file format, a feature introduced in 0.11, reduced query times by about 15 percent. Vectorization, a Microsoft contribution to Hive 0.13, sped it up by about 10 percent. Enabling cost-based query optimization, another new feature in 0.13, provided a 20 percent boost.

I should point out that these tests were run on my laptop, on small data sets, using ad-hoc queries, and don't represent a real-world measurement of the new performance capabilities in Hive 0.13. What is clear is that each of these features can be a valuable tool in a performance-tuning toolkit, especially for large-scale queries.

The amount of data to be collected and analyzed isn't going to diminish in coming years. The most commonly used metric in the enterprise data warehouse business -- the price per terabyte -- will only increase in importance. Although it's easy to calculate and understand, like most simple metrics it can also lead buyers astray without the complete picture in mind. Nevertheless, nearly everyone agrees that the cost of data storage in Hive is vastly lower (think 100 times lower) than in a traditional data warehouse.

Current users of traditional data warehouses should evaluate Hadoop to store unstructured data for analysis and inclusion in the data warehouse. Using Hive, it's possible to execute petabyte-scale queries to refine and cleanse data for later incorporation into data warehouse analytics. Hadoop and Hive could also be used in the reverse scenario: to off-load data summaries that would otherwise need to be stored in the data warehouse at much greater cost. Finally, Hive could be used to run experimental analytics on unstructured data that, if they prove their worth, could be moved into the data warehouse.

1 2 3 4 5 Page 4
Page 4 of 5
InfoWorld Technology of the Year Awards 2023. Now open for entries!