The Fifth Elephant 2014

A conference on big data and analytics

Analytics on Large Scale, Unstructured, Dynamic Data using Lambda Architecture

Submitted by Rajesh Muppalla (@codingnirvana) on Monday, 9 June 2014

videocam_off

Technical level

Intermediate

Section

Full talk

Status

Confirmed & Scheduled

View proposal in schedule

Vote on this proposal

Login to vote

Total votes:  +16

Objective

In this talk, I will focus on our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.

Description

Indix is a product intelligence platform. Our catalog has several million products and billions of price points collected from thousands of e-commerce web sites and is constantly growing. We collect product data via crawling product pages from these web sites. Our parsers extract product attributes from these pages which are then run through a series of machine learning algorithms to classify and extract deeper product attributes. This data gets deduped between stores and then matched across stores and is finally fed into our analytics engine which provides insights to our customers.

Our first attempt at building this system around two years ago was chaotic. We were dealing with e-commerce sites whose pages were unstructured and were constanly changing. Our parsers and machine learning algorithms were also improving regularly. All this meant that we had to run our algorithms on the entire data set very often. It was not uncommon for our data refreshes to run for days which meant high latency for product and price data. In addition to that, our data systems were not human fault tolerant. We had issues where an incorrect algorithm would get accidentally deployed to production and corrupt the data we were serving. Since our data store was mutable and did not mantain these changes, it was not easy to fix these corruption issues.

We realized soon that we had to re-think about our data system from ground up. We needed a simpler approach that would scale, be tolerant to human errors and can evolve with our product.

Lambda architecture, coined by Nathan Marz, the creator of Storm and Cascalog, seemed like a step in the right direction for us.
The system has been in production for more than a year now, handling 3X more data than our older system and most importantly is more robust.

Lambda architecture, at its core, is a set of architecture principles that allows both batch and real-time or stream data processing to work together while building immutability, recomputation and human fault tolerance into the system.

It has three layers - batch, serving and speed.

The batch layer is responsible for computing arbitrary views on the master data. Our master data is an immutable store in HDFS and we compute views using a series of Map Reduce jobs using Scalding and Spark. Our batch system runs recomputation every day on our entire data set.

The serving layer indexes and exposes precomputed views to be queried ad-hoc with low latency. We use HBase, Solr and our own inhouse inmemory implementation for the serving layer.

The speed layer deals only with new data and compensates for the high latency updates of the serving layer by creating realtime views. Our real time latency requirements are in few hours and not in seconds, which allows us to use a micro-batch architecture that is a stripped down version of our batch layer and uses the same technologies.

To get the final result, the batch and realtime views must be queried and the results merged together.

Topics I will cover

  • Why Lambda Architecture? What problems did it solve for us?
  • Technical Challenges encountered in building the lambda architecture
    • Schema Evolution
    • HDFS Small Files Issue
    • Code re-use between batch and real time systems
  • Modeling the data pipelines for each layer
  • Open problems

Speaker bio

Rajesh Muppalla is a co-founder and Director of Engineering at Indix, where he leads the data platform team that is responsible for collecting, organizing and structuring all the product related data collected from the web.

He is passionate about big data, large scale distributed systems, continuous delivery and algorithms. He also likes mentoring and coaching developers in pursuit of building better software.

Prior to Indix, he was a technical lead on Go-CD, an agile and release management product, at Thoughtworks. The product has been recently open-sourced.

He is a gold medallist in Computer Science from Pune University. In his final year of graduation, his team represented India at Asia finals of Microsoft Imagine Cup (then called Microsoft .NET campus challenge).

Comments

  • 1
    Govind Kanshi (@govindsk) 4 years ago

    Thanks Rajesh for proposing this. I am sure you will be covering how part so that others can also "replicate" your stack for their own domain problem.

    • 1
      Rajesh Muppalla (@codingnirvana) Proposer 4 years ago

      Govind,

      To establish the technical challenges, I would be going over our implementation at the relevant depth.

      For one to replicate this in their own domain should be possible as the core principles of Lambda Architecture and the tools we have used to implement them are domain agnostic.

  • 1
    Rahul Tanwani (@tanwanirahul) 4 years ago

    Rajesh, Can you please link the slides here?

  • 1
    Rajesh Muppalla (@codingnirvana) Proposer 4 years ago (edited 4 years ago)

    Hey Rahul,

    Here are the slides - http://bit.ly/1rGxth7

    Thanks,
    Rajesh

Login with Twitter or Google to leave a comment