Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

1 2 3 4 5 Page 3
Page 3 of 5

Working with Hive
As noted above, Hive speaks a simple, SQL-like language called HiveQL that supports data definition and manipulation statements. Any SQL user should find working with Hive a familiar experience. HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away.

Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.

There are a few significant differences in HiveQL and standard SQL. Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren't available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query.

The absence of these features won't significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster. Queries might need to be written a bit differently than you would for an engine that supports the full SQL language, but a seasoned user of traditional databases should have no trouble writing Hive queries. Many traditional SQL editing environments now support Hive via connectors, and Hive tables can be accessed from many SQL editors, including those from Oracle and Microsoft.

One major difference for a database user is the need to be aware of storage details. In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema. Therefore, Hive enforces schema on read. If the underlying data format has changed, Hive will do its best to compensate, but you will likely get unexpected results.

The Hive user must be aware of two other aspects of data storage: file formats and compression. Tuning Hive queries can involve making the underlying map-reduce jobs run more efficiently by optimizing the number, type, and size of the files backing the database tables. Hive's default storage format is text, which has the advantage of being usable by other tools. The disadvantage, however, is that queries over raw text files can't be easily optimized. 

Hive can read and write several file formats and decompress many of them on the fly. Storage requirements and query efficiency can differ dramatically among these file formats, as can be seen in the figure below (courtesy of Hortonworks). File formats are an active area of research in the Hadoop community. Efficient file formats both reduce storage costs and increase query efficiency.

File formats and file sizes on Hadoop
File formats and file sizes on HDFS. (Image courtesy of Hortonworks.)

Processing large data sets is often a multistep process, and HiveQL includes language constructs to specify the ETL pipeline. Often, depending on the nature of the particular problem, a job requires storing temporary tables, and moving several terabytes to HDFS can be impractical. Hive provides three types of UDFs (user-defined functions) that can be used within queries for custom processing.

1 2 3 4 5 Page 3
Page 3 of 5