Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS). The project members are Arinto (me :p) and Mário. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling and mentored by Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Flume-based Independent News Aggregator

It has been more than two weeks since my last post! 🙁 I was busy with exam, project reports and my trip preparation. Finally, I manage to find time (on my journey from Warsaw to Prague) to update my blog :p

Well, I would like to cover our SDS project titled “Flume-based Independent News Aggregator” but my project-mate, Mario, has covered as well in his blog. So in this case, I’ll just give you the link to Mario’s post, which is here.

As a follow up of the project, our professor asked us to experiment with the system with regards to its scalability. It is individual project, and I plan to experiment with Flume scalability in term of number of event it can support. Mario will do something related to its realiability and fault tolerance. I plan to update this blog once I finished the project 🙂

*Update: Post about the mini project of Flume Scalability can be found here.

Weekend with Flume – Part 1

I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume

Apache Flume
Apache Flume

Flume Primer

Quoting definition from Apache’s Flume Wiki:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.