Thesis: Pilot

After one and half month starting my master thesis, finally I have chance to start writing about it. And after getting the permission from one of my supervisors, Gianmarco, I can publish this post, yay!

In this pilot post, I would like to give overview of the thesis. In a nutshell, the thesis is about achieving high velocity in big data analytics, by developing distributed streaming machine learning framework. So, without further ado, here is the overview. ūüėÄ

Yes, my thesis is related to big data analysis
Yes, my thesis is related to big data analysis

*the above cartoon image is taken from Space & Light’s Flickr

The Overview

Everyone is talking about big data¬†and one of the initial questions surrounding big data is “how to store it?”. Continue reading Thesis: Pilot

Towards High Availability in YARN: Implementations and Experiments

This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.

Implementation

As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

Continue reading Towards High Availability in YARN: Implementations and Experiments

Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS).¬†The project members are Arinto (me :p) and M√°rio. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling¬†and mentored by¬†Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Continue reading Towards High Availability in YARN: Motivation and Proposed Solution

  1. Apache Hadoop YARN Background and Overview []