Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS). The project members are Arinto (me :p) and Mário. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling and mentored by Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Continue reading Towards High Availability in YARN: Motivation and Proposed Solution

  1. Apache Hadoop YARN Background and Overview []

Dremel – Paper Review

Time is really ticking and somehow this semester I do not able to post as often as last semester.. Well, let’s start posting again.. hehe

I did paper review on Dremel (or here for ACM version) as part of ID2220 (Advanced Topics in Distributed System assignment) and here is the summary of my review. I also attached very nice slides on Dremel done by my classmate, Maria, at the end of this post.

What is Dremel?

Data analytics platform that allows interactive/ad-hoc exploration for web-scale data sets. Continue reading Dremel – Paper Review

last.fm Crawler

Two weeks ago, I, Mario & Zafar had mini project to crawl last.fm’s social graph. We performed Random Walk in the social graph and collected the user data such as age, playcounts and number of playlists. Using the collected data, we estimated the property of last.fm user using simple average and normalized average by the number of friends that user has (node degree).  The detail of the project can be found in Mario’s post, and I attached the project slides for easy reference:

 

 

The Curious Case of Consistency – Part 2

Arggghh.. I broke my promise!! I should have finished this post earlier.. :(. huffff..  I was busy with school assignments and activities with Indonesian societies in Stockholm hehe.. maybe I should write on it as well humm… okay, now back to business 🙂

In the previous post, I wrote about several consistency types from Doug Terry‘s breakfast talk in my school. Now, it’s time to see their application in simple baseball game.

Simple Baseball Game

The baseball game itself will consist of several “entities” that are “interested” in the latest score of the game. The “entities” are represented as pseudocode, and the term “interested” can be interpreted as read or write depending on entity type. We will discuss what kind of consistency that is needed for each entity below

Continue reading The Curious Case of Consistency – Part 2

The Curious Case of Consistency – Part 1

Last week I attended interesting breakfast talk by Doug Terry, principal researcher from Microsoft Research Silicon Valley and I think it is pretty cool talk B-). The talk itself is about other consistency types which lie between Strong and Eventual Consistency, and how the additional consistency types are (practically) used through simple pseudo-code of baseball game. And this is the first of two planned posts for this topic.

Let’s start with two most basic consistency model (which we usually learn in our first Distributed System course/module in University): Strong Consistency and Eventual Consistency.

In Strong Consistency, every node in distributed system will always see the same and latest view after a specific node update the data. This consistency type sacrifices performance and availability in order to ensure consistent view across the node in the distributed system since it is expensive in term of network and computing resources needed to maintain the consistency. Example of system that uses Strong Consistency is Windows Azure.  Strong Consistency is desirable within a data center however performance and availability concerns start raising in geo-replication service that spans in multiple-data center.

I use strong consistency!

Continue reading The Curious Case of Consistency – Part 1

Flume Event Scalability Analysis

This is the follow up of the project in this post. Our professor asked us to perform scalability analysis of the technology that we used, and write simple report on it. I decided to analyze Flume scalability in term of the number of events that can be supported by the Flume configuration. The project itself is inspired from Mike Percy’s Flume-NG performance measurement, and I re-used some of his software components. There are two main differences between this project and Mike’s work, which are:

  •  This experiment introduces one-to-one relationship between the nodes and Flume load generator. Tht means, each Flume load generator process exists in an independent node (which is Amazon EC2 medium instance).
  • This experiment introduces cascading setup, which will verify whether there is improvement in scalability or not compared to non-cascading setup

Weekend with Flume – Part 2

After covering some of basic configurations of Flume in Part 1 of Weekend with Flume series, I’ll cover Avro Source, Avro Sink and HDFS sink in this post. Let’s pick a scenario from our school project below. We setup this configuration in Amazon Web Service (AWS), but I will not discuss about our experiences with AWS in this post.

Flume-based News Agregator
Flume-based News Aggregator

Avro Sink Configuration

We have Agent#1 connecting to Collector through a pair of Avro Sink and Avro Source. To achieve this configuration, we have this following flume configuration file in Agent#1. Continue reading Weekend with Flume – Part 2

Weekend with Flume – Part 1

I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume

Apache Flume
Apache Flume

Flume Primer

Quoting definition from Apache’s Flume Wiki:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Continue reading Weekend with Flume – Part 1

Rise of Network Virtualization – Final

This is the follow-up-post of this post about EEDC project this semester. Well, I would like to have more time to polish my slides but that’s fine. It’s already past 🙂

I used historical perspective to discuss about the “rise” of network virtualization, in this context I focused on SDN (Software-Defined Networking) and it is all started in 2007. At that time, network are faster, but not better. It has several limitations such as high complexity in maintaining it, high possibility of policy inconsistent across devices in the network, inability to scale, and dependency on vendor. On the other hand, the need of new network architecture is crucial due to traffic pattern change (not only from client to server or vice versa, but also between nodes in server cluster), consumerization of IT, and rise of cloud service and Big Data. Continue reading Rise of Network Virtualization – Final

Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final

This is my first post in the new domain :). And I would like to present the final report and presentation of Decentralize System (DS) project, which I posted the overview here.

We continued with the survey of Decentralized Storage Systems(DSS), and we ranked them based on these five following characteristics

  1. Availability (AV). Formal definition is fraction of system uptime and able to perform normally over the uptime + downtime. In the DSS context, it is usually reflected as degree of system resistance of churn and fault tolerance level that implemented by the system, and how easy and fast we can get file or data that we want.
  2. Scalability (SC). Ability of the system to be enlarged  to accommodate its growth. In the context of DSS, it could be in term of number of nodes, number of messages exchanged, number of data stored in the storage system.
  3. Eventual Consistency (ECO). ECO is prefered than C, because Availability is more preferred and pure Consistency is not practical and expensive to achieve in term of required messages between nodes.
  4. Performance (P). Example of metrics that can be used to measure performance are response time to satisfy search request. We analyze whether the system put emphasize on it or not.
  5. Security (SE). Resistance into some degrees of attack or breach such as compromising data integrity and unauthorized data access by malicious node. Due to the nature volunteer computing, security is one of the important feature to attract volunteer and ensure data validity

We also identified the focus for each DSS that we survey as shown below. Continue reading Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final