Moving Hadoop beyond batch processing and MapReduce

Apache Tez framework opens the door to a new generation of high-performance, interactive, distributed data processing applications

Data is the new currency of the modern world. Businesses that successfully maximize its value will have a decisive impact on their own value and on their customers’ success. As the de-facto platform for big data, Apache Hadoop allows businesses to create highly scalable and cost-efficient data stores. Organizations can then run massively parallel and high-performance analytical workloads on that data, unlocking new insight previously hidden by technical or economic limitations. Hadoop offers data value at unprecedented scale and efficiency -- in part thanks to Apache Tez and YARN.

Analytic applications perform data processing in purpose-driven ways that are unique to specific business problems or vendor products. There are two prerequisites to creating purpose-built applications for Hadoop data access. The first is an "operating system" (somewhat akin to Windows or Linux) that can host, manage, and execute these applications in a shared Hadoop environment. Apache YARN is that data operating system for Hadoop. The second prerequisite is an application-building framework and a common standard that developers can use to write data access applications that run on YARN.

Apache Tez meets this second need. Tez is an embeddable and extensible framework that enables easy integration with YARN and allows developers to write native YARN applications that bridge the spectrum of interactive and batch workloads. Tez leverages Hadoop's unparalleled ability to process petabyte-scale datasets, allowing projects in the Apache Hadoop ecosystem to express fit-to-purpose data processing logic, yielding fast response times and extreme throughput. Tez brings unprecedented speed and scalability to Apache projects like Hive and Pig, as well as to a growing field of third-party software applications designed for high-speed interaction with data stored in Hadoop.

Hadoop in a post-MapReduce world

Those familiar with MapReduce will wonder how Tez is different. Tez is a broader, more powerful framework that maintains MapReduce’s strengths while overcoming some of its limitations. Tez retains the following strengths from MapReduce:

  • Horizontal scalability with increasing data size and compute capacity
  • Resource elasticity to work both when capacity is abundant and when it’s limited
  • Fault tolerance and recovery from inevitable and common failures in distributed systems
  • Secure data processing using built-in Hadoop security mechanisms

But Tez is not an engine by itself. Rather, Tez provides common primitives for building applications and engines -- thus, its flexibility and customizability. Developers can write MapReduce jobs using the Tez library, and Tez comes with a built-in implementation of MapReduce, which can be used to run any existing MapReduce job with Tez efficiency.

MapReduce was (and is) ideal for Hadoop users that simply want to start using Hadoop with minimal effort. Now that enterprise Hadoop is a viable, widely accepted platform, organizations are investing to extract the maximum value from data stored in their clusters. As a result, customized applications are replacing general-purpose engines such as MapReduce, bringing about greater resource utilization and improved performance.

The Tez design philosophy

Apache Tez is optimized for such customized data-processing applications running in Hadoop. It models data processing as a data flow graph, so projects in the Apache Hadoop ecosystem can meet requirements for human-interactive response times and extreme throughput at petabyte scale. Each node in the data flow graph represents a bit of business logic that transforms or analyzes data. The connections between nodes represent movement of data between different transformations.

Once the application logic has been defined via this graph, Tez parallelizes the logic and executes it in Hadoop. If a data-processing application can be modeled in this manner, it can likely be built with Tez. Extract-Transform-Load (ETL) jobs are a common form of Hadoop data processing, and any custom ETL application is a perfect fit for Tez. Other good matches are query-processing engines like Apache Hive, scripting languages like Apache Pig, and language-integrated, data processing APIs like Cascading for Java and Scalding for Scala.

When used in conjunction with other Apache projects, Tez allows for more expressive processing tasks. Apache Hive with Tez brings high-performance SQL execution to Hadoop. Apache Pig with Tez is optimized for large-scale, complex ETL in Hadoop. Cascading and Scalding can use Tez to run the most efficient translations of Java and Scala code.

Tez includes intuitive Java APIs that offer developers avenues for creating unique data-processing graphs for the most efficient representation of their applications’ data-processing flows. After a flow has been defined, Tez provides additional APIs to inject custom business logic that will run in that flow. These APIs combine Inputs (that read data), Outputs (that write data), and Processors (that process data) in a modular environment. Think of these as build-your-own Lego blocks for data analysis.

Applications built with these APIs can run efficiently in Hadoop while the Tez framework handles the complexities of interacting with the other stack components. The result is a custom-optimized, natively integrated YARN application that’s efficient, scalable, fault-tolerant, and secure in multitenant Hadoop environments.

Applying Tez

Thus, businesses can use Tez to create purpose-built analytics applications in Hadoop. When doing so, they can draw on two types of application customizations in Tez: They can define the data flow, and they can customize the business logic.

The first step is to define the data flow that solves the problem. Multiple data flow graphs can solve the same problem, but choosing the right one has a large impact on the application’s performance. For example, Apache Hive’s performance is vastly improved by being able to define optimal joining graphs using Tez APIs.

Then, for the same data flow, businesses can customize the business logic using the Inputs, Outputs, and Processors that execute the task. 

Note that in the same way businesses can customize their data processing applications, ISVs and other vendors can draw on Tez to showcase their unique value propositions. For example, a storage provider can swap inputs and outputs with custom implementations for its storage service. If a vendor has advanced hardware -- say, with RDMA or InfiniBand -- then it is easy to plug in an optimized implementation.

The big data landscape is exploding with possibilities, with large volumes of new types of data captured, stored, and processed by Apache Hadoop. Because it reduces the cost, complexity, and risk of managing big data, Hadoop has taken its rightful place in the modern data architecture -- as a mainstream component in the enterprise data warehouse.

Apache Tez makes Hadoop even more applicable, with opportunities to solve existing use cases and discover new ones with purpose-built applications. Tez unlocks the potential of big data by enabling the next generation of high-performance, interactive applications in Hadoop, without requiring the elimination of any process or application that already works well. 

Bikas Saha (@bikassaha) has been working in the Apache Hadoop ecosystem since 2011. He is a committer/PMC member of the Apache Hadoop and Tez projects, where he has been a key contributor in making Hadoop run natively on Windows and has focused on YARN and the Hadoop compute stack. Prior to Hadoop, he worked extensively on the Dryad distributed data processing framework that runs on some of the world's largest clusters as part of Microsoft Bing infrastructure.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

To comment on this article and other InfoWorld content, visit InfoWorld's LinkedIn page, Facebook page and Twitter stream.
Related:
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.