The Fifth Elephant 2012

Finding the elephant in the data.

Managing Data on Hadoop

Submitted by prashant singh (@prashantkr2002) on Wednesday, 6 June 2012

videocam_off

Technical level

Intermediate

Section

Big Data Infrastructure & Processing

Session type

Lecture

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +5

Objective

The paper talks about an approach on how to manage high volume data movement on hadoop, making it available for processing in Yahoo!. As part of grid data management, we load Terabytes of data daily onto hadoop clusters and replicate the same to BCP clusters. As part of this tech talk, we want to share our experiences, challenges and techniques of high volume data movement on hdfs.

Description

It is crucial for web applications to mine data generated from different logs to get relevant information and trending for research and development projects and for a growing number of production processes across Yahoo!. This lecture will focus on the challenges we face to manage large volume of data movement across hadoop clusters, within strict SLAs and prioritizing the data flow based on its importance at Yahoo!.

Requirements

Knowledge of Hadoop

Speaker bio

Prashant K Singh works at Yahoo! as a Principal Engineer and handles data management and hadoop operations. As part of this team, he manages around 20 hadoop clusters with ~40K nodes with 300+ PB of data with a total cluster capacity of ~1 Exabyte.

Prior to Yahoo! Prashant has worked with MakeMyTrip, where he was responsible for setting up data center activities to in house and migrating the webportal from a windows platform to open source platform and making it stable and more capable to handle large amount of user traffic.

Abhishek Dan manages the hadoop service engineering team at Yahoo! which is responsible for hadoop cluster management and data management on hadoop clusters.

Comments

Login with Twitter or Google to leave a comment