The Fifth Elephant 2014

A conference on big data and analytics

Apache Tez: Accelerating Hadoop Data Pipelines

Submitted by t3rmin4t0r (@t3rmin4t0r) on Friday, 23 May 2014

videocam_off

Technical level

Beginner

Section

Full talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +13

Objective

Apache Tez is a DAG execution engine which exists as a super-set of traditional Map Reduce. Tez designed as a replacement computational model for nearly everything that currently uses map-reduce.

The talk is meant to be an introduction to Tez, its architecture and its evolution from traditional map-reduce.

Description

Apache Tez is a modern data processing engine designed for YARN on Hadoop 2. Tez aims to provide high performance and efficiency out of the box, across the spectrum of low latency queries and heavy-weight batch processing. With a clear separation between the logical app layer and the physical data movement layer, Tez is designed from the ground up to be a platform on top of which a variety of domain specific applications can be built. Tez has pluggable control and data planes that allow users to plug in custom data transfer technologies, concurrency-control and scheduling policies to meet their exact requirements. The talk will elaborate on these features via real use cases from early adopters like Hive, Pig and Cascading.

Speaker bio

Gopal works on performance problems in hadoop ecosystem. He's involved with the Stinger effort from Hortonworks to improve the SQL data access layers in Hadoop. He is a contributor to the Apache Hive project and a committer for the Apache Tez project.

Links

Comments

  • 1
    Viral B. Shah (@viralbshah) 4 years ago

    Looking forward to this talk. I think the talk would be more accessible with a real use-case and deployment examples.

  • 1
    t3rmin4t0r (@t3rmin4t0r) Proposer 4 years ago

    Tez is a platform tool, so it is harder to describe Tez examples - it like asking for alphabet examples. You need to make words of some sort.

    For instance, Tez has no deployment example.

    A non-admin user on a HADOOP-2.2.x cluster can use Tez without any admin interaction.

    It is a library built on top of Hadoop YARN containers. Your cluster has YARN deployed on it and Tez is a library component with no daemons & co-exists in the same cluster as old MR, Apache Storm, Apache Spark and Apache Accumlo (all of them as discrete containers in 1 multi-tenant cluster, to be precise).

    The lack of a "deployment example" is intentional. I have a slightly long write up about it on my notes - https://github.com/t3rmin4t0r/notes/wiki/I-Like-Tez,-DevOps-Edition-(WIP)

    Tez has plenty of example use-cases, hidden under the hood of Apache Hive-13 and Apache Pig trunk.

    The advantage of Tez kicks in without rewriting PIG jobs or Hive queries, because their respective compilers compile into Tez instead of bi-partite MR. We can already run hadoop-1.x apps using Tez without recompiling (binary compatible) using a fixed 2-stage DAG overlay.

    To understand the world of difference it makes fro MR, you have to look at something like the TPC.org's TPC-H benchmark query-9 translated into Tez.

    http://people.apache.org/~gopalv/tpch-q9-plan.svg

    That is not very different from how a multi-job map-reduce dependencies would look like, but it runs without stage completion & connects reduce stages directly to other reduce stages without scheduling identity map-tasks.

    I have a very different talk that goes into a deep-dive of how you translate an SQL query into a Tez DAG for execution.

    http://www.slideshare.net/t3rmin4t0r/hivetez-a-performance-deep-dive/19

    The pig guys have their own version of this.

    http://www.slideshare.net/ydn/pig-on-tez-hugfeb19/14

    The cascading development team is translating their job pipeline into Tez as well.

    Hortonworks has a recent presentation which discusses the query performance improvements made because of new Tez query plans

    http://www.slideshare.net/hortonworks/apache-hive-013-performance-benchmarks/6

  • 1
    Kiran Kumar (@kiransvkumar) 4 years ago

    Cool session and I like where we need to use and why.

  • 1
    Raghuveer Madyastha (@krmadyastha) 4 years ago

    thanks for a good session… can you share your deck

Login with Twitter or Google to leave a comment