I already had Hadoop running on an Ubuntu 8.10 system. To add Hive, I downloaded the gzip file from hadoop.apache.org/hive, and unpacked it into a folder next to the Hadoop home folder. Next, I defined a HIVE_HOME environment variable, and executed a few HDFS commands to create specific HDFS subdirectories that Hive requires. I launched the Hive shell and was ready to go. Total time was maybe 20 minutes. (This process is described in Hive's wiki, just off the Hive main Web page.)
HQL and SQL
Although Hive's principal goal is to provide an SQL-like query mechanism for Hadoop-based data, mimicry of SQL in such an environment can -- for a variety of reasons -- go only so far. First, HDFS was built for batchlike applications that pour large quantities of data into massive files that are subsequently processed by Hadoop map/reduce tasks. It is a write-once, read-often-and-sequentially file system. HDFS does not currently support random write operations and likely never will. Hence, HQL's closest approach to an SQL INSERT INTO command is INSERT OVERWRITE, which overwrites a table's existing content with new data. For example, suppose you have already created a Hive database table called TA, and you want to add new data to it from table TB. The HQL for this is:
INSERT OVERWRITE TA SELECT * FROM
(SELECT * FROM TA UNION
SELECT * FROM TB)
The new data is added by overwriting the old table with the concatenation of its original content and the data in TB.
In addition, Hive does not store database tables in a specialized file format. Instead, it causes ordinary HDFS files to "appear" to be database files. This illusion becomes apparent when you export data into a Hive table from a file stored in a standard Linux file system. No special conversion takes place; the file is copied byte for byte into Hive from its source image in the Linux directory. This means that you have to describe the structure of the file at the time you CREATE it as a Hive table.
For example, suppose I had converted the entire Encyclopedia Britannica into a single, linear text file and processed that to produce a data file consisting of word/offset pairs. For each line in the file, the first field is the text of a given word in the encyclopedia, and the second field is the large integer offset of the word's position in the text file. (So, the line "bob 1293" indicates that "bob" was the 1,293rd word in the encyclopedia.) Assuming the file's fields are separated by tab characters and the lines by line feeds, I could create a table for this file:
CREATE TABLE WORDLOC (theWord STRING, loc BIGINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\009'
LINES TERMINATED BY '\012'
STORED AS TEXTFILE;
The structure of the file is explicitly described in the CREATE command. And when I imported the data into Hive, it would simply be copied directly, with no structural changes.
Nevertheless, Hive is impressive, particularly when you consider what is going on behind the scenes. It is converting HQL expressions into compiled-and-executed map/reduce tasks. In addition, the conversion is not a brute-force operation; Hive applies some intelligence. For example, Hive knows when conversion is unnecessary, so the simple expression "SELECT * FROM TA" will execute directly. Hive also performs "pipelining" of complex queries where possible. That is, if a query is resolved into a linear sequence of map/reduce tasks, the intermediate output of the first map/reduce job is passed on to the next job in the series, even before the first job is completed -- and so on down the line. This significantly improves throughput, as different stages in the pipeline are able to execute in parallel.
Having trouble installing and setting up Win10? You aren’t alone. Here are many of the most common...
Picking an Android phone can be difficult, but we're here to help. These are the top Android phones you...
Confidence in our power over machines also makes us guilty of hoping to bend reality to our code
From machine learning to digital twins, opportunities abound in emerging (and converging) tech trends
Slack reached a $1 billion valuation faster than any startup in history. Now it must make key decisions...
As the container orchestration system booms in popularity and acceptance, so will commercial Kubernetes...
Project Trinity would enhance Streams with data processing hardware features for greater efficiency