Bootstrapping Twitter Storm

I have chances to use Twitter Storm for my thesis and in this post I would like ¬†to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project ūüôā

Of course I am NOT talking about this movie :D
Of course I am NOT talking about this movie ūüėÄ

Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.

What is Twitter Storm?

Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading Bootstrapping Twitter Storm

Distributed Streaming Classification: Related Work

In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification

Classification

Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.

Continue reading Distributed Streaming Classification: Related Work

Towards High Availability in YARN: Implementations and Experiments

This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.

Implementation

As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

Continue reading Towards High Availability in YARN: Implementations and Experiments

Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS).¬†The project members are Arinto (me :p) and M√°rio. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling¬†and mentored by¬†Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Continue reading Towards High Availability in YARN: Motivation and Proposed Solution

  1. Apache Hadoop YARN Background and Overview []

Dremel – Paper Review

Time is really ticking and somehow this semester I do not able to post as often as last semester.. Well, let’s start posting again.. hehe

I did paper review on Dremel (or here for ACM version) as part of ID2220 (Advanced Topics in Distributed System assignment) and here is the summary of my review. I also attached very nice slides on Dremel done by my classmate, Maria, at the end of this post.

What is Dremel?

Data analytics platform that allows interactive/ad-hoc exploration for web-scale data sets. Continue reading Dremel – Paper Review

last.fm Crawler

Two weeks ago, I, Mario & Zafar had mini project to crawl last.fm’s social graph. We performed Random Walk in the social graph and collected the user data such as age, playcounts and number of playlists. Using the collected data, we estimated the property of last.fm user using simple average and normalized average by the number of friends that user has (node degree). ¬†The detail of the project can be found in Mario’s post, and I attached the project slides for easy reference:

 

 

The Curious Case of Consistency – Part 2

Arggghh.. I broke my promise!! I should have finished this post earlier.. :(. huffff.. ¬†I was busy with school assignments and activities with Indonesian societies in Stockholm hehe.. maybe I should write on it as well humm… okay, now back to business ūüôā

In the previous post, I wrote about several consistency types from Doug Terry‘s breakfast talk in my school. Now, it’s time to see their application in simple baseball game.

Simple Baseball Game

The baseball game itself will consist of several “entities” that are “interested” in the latest score of the game. The “entities” are represented as pseudocode, and the term “interested” can be interpreted as read or write depending on entity type. We will discuss what kind of consistency that is needed for each entity below

Continue reading The Curious Case of Consistency – Part 2