The Fifth Elephant 2012

Finding the elephant in the data.

How the Internet Archive preserves petabytes of data

Submitted by Anand Chitipothu (@anandology) on Wednesday, 20 June 2012

videocam_off

Technical level

Beginner

Section

Big Data Infrastructure & Processing

Session type

Lecture

Status

Confirmed

Vote on this proposal

Login to vote

Total votes:  +11

Objective

Using Internet Archive as a case study, this talk presents aspects of big data in the context of long-term preservation.

Description

The Internet Archive has been archiving the internet since 1996. It also archives and makes available a vast collection of data including films, audio and books.

The Internet Archive is one of the earliest organizations to work with petabytes of data. It built its own infrastructure to store, process and manage its data reliably, much before the cloud. Being an archive, preservation of data is the primary concern and it affects engineering decisions.

This talk is an introduction to the Internet Archive and its infrastructure.

Speaker bio

This talk will be presented by Anand Chitipothu and Noufal Ibrahim. Both of them are employees of the Archive, working remotely from Bangalore.

Anand is a software consultant and trainer. He has been working with the Archive since 2007. He is co-ordinator of the PyCon India 2012 conference.

Noufal is a freelance trainer and consultant based out of Bangalore. Founder of PyCon India and organiser of the first two conferences in India.

Links

Slides

http://internetarchive.github.com/fifthelephant2012-presentation/

Comments

Login with Twitter or Google to leave a comment