The Fifth Elephant 2014

A conference on big data and analytics

De-dup on Hadoop

Submitted by Neeta Pande (@neetapande) on Thursday, 12 June 2014

videocam_off

Technical level

Beginner

Section

Crisp talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +4

Objective

In this talk, I wish to share experiences we had at Intuit in building Master Data Management solution on Hadoop platform. At the core MDM solution consists of fuzzy matching, entity resolution and de-duplication. Solving these patterns on Big Data Platform like Hadoop is the focus of this discussion.

Description

In many enterprises it's commonly seen that business data has a lot of client, customer, vendor or product lists in different formats and systems, many of which are near duplicates.MDM solutions on RDBMS have been prominent for many years in almost every enterprise to support master data management by removing duplicates, standardizing data and incorporating rules to eliminate incorrect data from entering the system in order to create an authoritative source of master data. MDM on Big data platforms like Hadoop have benefits as well as it's own set of challenges when compared with the RDBMS counterparts. I will cover them in detail primarily focusing on building this solution on Hadoop.

Speaker bio

I am Data Architect at Intuit with 13+ years of experience in BI and Data Analytics. Prior to Intuit, I have worked at Intel, Oracle and EMC applying BI in Manufacturing, Finance and Storage Analytics domain.

Slides

http://www.slideshare.net/neetapande/dedup-with-hadoop

Comments

Login with Twitter or Google to leave a comment