Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. Developed by Facebook engineers and contributed to the Apache Foundation as an open source project, Hive is now at the forefront of big data analysis in commercial environments.
Hive, like the rest of the Hadoop ecosystem, is a fast-moving target. This review covers version 0.13, which addresses several shortcomings in previous versions. It also brings a significant speed boost to SQL-like queries across large-scale Hadoop clusters, building on new capabilities for interactive query introduced in prior releases.
[ Also on InfoWorld: Know this about Hadoop right now | Learn how Hadoop works and how you can reap its benefits: Download InfoWorld's Hadoop Deep Dive PDF. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]
Hive is fundamentally an operational data store that's also suitable for analyzing large, relatively static data sets where query time is not important. Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge.
A typical data warehouse includes many expensive hardware and software components such as RAID or SAN storage, optimized ETL (extract, transform, load) procedures for cleaning and inserting data, specialized connectors to ERP and other back-end systems, and schemas designed around the questions an enterprise wants to ask such as sales by geography, product, or channel. The warehouse ecosystem is optimized around bringing enriched data to the CPU to answer the classes of questions the schema was designed for.
In order to address these shortcomings, the Hive community began a program (sometimes called the "Stinger" initiative) to improve query speed, with the goal of making Hive suitable for real-time, interactive queries and exploration. These improvements were delivered in three phases in versions 0.11, 0.12, and 0.13 of Hive.
Finally, although HiveQL, the query language, is based on SQL-92, it differs from SQL in some important ways due to its running on top of Hadoop. For instance, DDL (Data Definition Language) commands need to account for the fact that tables exist in a multi-user file system that supports multiple storage formats. Nevertheless, SQL users will find the HiveQL language familiar and should not have any problems adapting to it.
Hive platform architecture
From the top down, Hive looks much like any other relational database. Users write SQL queries and submit them for processing, using either a command line tool that interacts directly with the database engine or by using third-party tools that communicate with the database via JDBC or ODBC. The picture looks something like this:
By using the JDBC and ODBC drivers, available for Mac and Windows, data workers can connect their favorite SQL client to Hive to browse, query, and create tables. For power users, there is still the original, thick client CLI that interacts directly with the Hive driver. This client is the most robust, but it requires direct access to Hadoop and therefore is most suitable for local network operations where firewalls, DNS, and network topology aren't an issue.
Working with Hive
As noted above, Hive speaks a simple, SQL-like language called HiveQL that supports data definition and manipulation statements. Any SQL user should find working with Hive a familiar experience. HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away.
Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.
There are a few significant differences in HiveQL and standard SQL. Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren't available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query.
The absence of these features won't significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster. Queries might need to be written a bit differently than you would for an engine that supports the full SQL language, but a seasoned user of traditional databases should have no trouble writing Hive queries. Many traditional SQL editing environments now support Hive via connectors, and Hive tables can be accessed from many SQL editors, including those from Oracle and Microsoft.
One major difference for a database user is the need to be aware of storage details. In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema. Therefore, Hive enforces schema on read. If the underlying data format has changed, Hive will do its best to compensate, but you will likely get unexpected results.
For example, let's say you want to do a query that's not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive. By using a UDF, your custom query logic could be embedded into your SQL query, saving those steps and also leveraging the query planning and caching infrastructure of Hive. (See Hivemall for a neat example of enabling machine learning in Hive via UDFs.)
Because UDFs run as part of the Hive query and have direct access to the data, they run efficiently and eliminate intermediate steps in the pipeline. UDFs must be written in Java, so they aren't for the average SQL programmer, but they fill an important niche in the Hive toolkit. Without them, certain classes of problems would be much more difficult to solve.
Hive query performance
Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive, and there's no question the effort was successful. I tested each of the major speed features from the 0.11 to 0.13 releases to see how much they added to performance. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework.
In my testing, I saw query times drop by half when run on Tez, and on queries that could be cached, times dropped another 30 percent. On larger data sets, the speedup was even more dramatic. Using the ORC (Optimized Row Columnar) file format, a feature introduced in 0.11, reduced query times by about 15 percent. Vectorization, a Microsoft contribution to Hive 0.13, sped it up by about 10 percent. Enabling cost-based query optimization, another new feature in 0.13, provided a 20 percent boost.
I should point out that these tests were run on my laptop, on small data sets, using ad-hoc queries, and don't represent a real-world measurement of the new performance capabilities in Hive 0.13. What is clear is that each of these features can be a valuable tool in a performance-tuning toolkit, especially for large-scale queries.
Organizations or departments without a data warehouse can start with Hive to get a feel for the value of data analytics while keeping startup costs to a minimum. Although Hive doesn't offer a complete data warehouse solution, it does make a great, low-cost, large-scale operational data store with a fair set of analytics tools. If your analytics needs outgrow those satisfied by Hive, many traditional data warehouse vendors offer connectors and tools to bring the data into the warehouse, preserving your investments.
Until the playing field levels, companies making the best decisions -- decisions based on data and analytics -- will have a competitive advantage. Hive offers near linear scalability in query processing, an order of magnitude better price/performance ratio than traditional enterprise data warehouses, and a low barrier to entry. With 10TB enterprise data warehouse solutions costing around $1 million, managing large unstructured data sets with Hive makes a lot of sense.
Apache Hive at a glance
|Platforms||Works with Hadoop 0.20.x, 0.23.x.y, 1.x.y, 2.x.y|
|Cost||Free open source under the Apache License, Version 2.0|
This article, "Review: Apache Hive brings real-time queries to Hadoop," was originally published at InfoWorld.com. Follow the latest developments in big data and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.