Open source Hive: Large-scale, distributed data processing made easy
Thank heaven for Hive, a data analysis and query front end for Hadoop that makes Hadoop data files look like SQL tables
I already had Hadoop running on an Ubuntu 8.10 system. To add Hive, I downloaded the gzip file from hadoop.apache.org/hive, and unpacked it into a folder next to the Hadoop home folder. Next, I defined a HIVE_HOME environment variable, and executed a few HDFS commands to create specific HDFS subdirectories that Hive requires. I launched the Hive shell and was ready to go. Total time was maybe 20 minutes. (This process is described in Hive's wiki, just off the Hive main Web page.)
HQL and SQL
Although Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is:
INSERT OVERWRITE TA SELECT * FROM
(SELECT * FROM TA UNION
SELECT * FROM TB)
The new data is added by overwriting the old table with the concatenation of its original content and the data in TB.
In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.
For example, suppose I had converted the entire Encyclopedia Britannica into a single, linear text file and processed that to produce a data file consisting of word/offset pairs. For each line in the file, the first field is the text of a given word in the encyclopedia, and the second field is the large integer offset of the word's position in the text file. (So, the line "bob 1293" indicates that "bob" was the 1,293rd word in the encyclopedia.) Assuming the file's fields are separated by tab characters and the lines by line feeds, I could create a table for this file:
CREATE TABLE WORDLOC (theWord STRING, loc BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\009'
LINES TERMINATED BY '\012'
STORED AS TEXTFILE;
The structure of the file is explicitly described in the CREATE command. And when I imported the data into Hive, it would simply be copied directly, with no structural changes.
Nevertheless, Hive is impressive, particularly when you consider what is going on behind the scenes. It is converting HQL expressions into compiled-and-executed map/reduce tasks. In addition, the conversion is not a brute-force operation; Hive applies some intelligence. For example, Hive knows when conversion is unnecessary, so the simple expression "SELECT * FROM TA" will execute directly. Hive also performs "pipelining" of complex queries where possible. That is, if a query is resolved into a linear sequence of map/reduce tasks, the intermediate output of the first map/reduce job is passed on to the next job in the series, even before the first job is completed -- and so on down the line. This significantly improves throughput, as different stages in the pipeline are able to execute in parallel.