by Simrat Hanspal (@simrathanspal) on Friday, 17 June 2016

+19
Vote on this proposal
Status: Confirmed & Scheduled
View session in schedule
Section
Crisp talk

Technical level
Intermediate

Media

Abstract

The goal of this talk is to help build an understanding of the performances of the following packages -
R Dataframe
R data.table
Pandas
Numpy
PySpark RDDs
PySpark Dataframes
RedShift
While these packages are operating in different but intersecting realms of use cases, depending on the cardinality of the data and the operations that will be performed on it, some are more suited than others for the task at hand. Before making the plunge into ‘Big Data’ it is important to understand the point at which one is trying to kill an ant with a sledgehammer. This talk outlines our attempts at grasping this. We will not evaluate a plethora of tools, just the ones that we considered for our requirements.

Outline

We will cover the design and development of experiments and present benchmark results across select tabular (eg.: join, aggregation etc.) and non-tabular operations (e.g. matrix multiplication, sort/search etc.). For further analysis the code will be open-sourced soon after the talk.

Speaker bio

Simrat is a Data Scientist, Engineering Ninja and Inspector Gadget at Mad Street Den. She builds data platforms and models to make sense of user and product data in e-commerce online retail.

Comments

  • 2
    [-] Noriega (@noriega) a year ago (edited a year ago)

    Can you post any other link than your company website? Some slides or github page or your blog maybe?

    • 1
      [-] Simrat Hanspal (@simrathanspal) Proposer a year ago

      Have upload draft slides, please note these are not complete and the github url will be realised soon.
      Thanks!

Login with Twitter or Google to leave a comment