Hoeffding Tree for Streaming Classification

In the previous post, we have summarized C4.5 decision tree induction. Well, since my thesis is about distributed streaming machine learning, it’s time to talk about streaming decision tree induction and I think it’s better start with defining “streaming machine learning” in general.

Streaming Machine Learning

Streaming machine learning can be interpreted as performing machine learning in streaming setting. In this case, streaming setting is characterized by:

  • High data volume and rate, such as transactions logs in ATM and credit card operations, call log in telecommunication company, and social media data i.e. Twitter tweet stream or Facebook status update stream
  • Unbounded, which means these data always arrive to our system and we won’t be able to fit them in memory or disk for further analysis with the techniques. Therefore, this characteristic implies we are limited to analyse the data once and there is little chance to revisit the data

Continue reading Hoeffding Tree for Streaming Classification

C4.5 Decision Tree Implementation

It’s time to go deeper in decision tree induction. In this post, I’ll give summary on real-world implementation (i.e. the implementation has been used in actual data mining scenario) called C4.5.

C4.5

C4.5 is collection of algorithms for performing classifications in machine learning and data mining. It develops the classification model as a decision tree. C4.5 consists of three groups of algorithm: C4.5, C4.5-no-pruning and C4.5-rules. In this summary, we will focus on the basic C4.5 algorithm

Algorithm

In a nutshell, C4.5 is implemented recursively with this following sequence

  1.     Check if algorithm satisfies termination criteria
  2.     Computer information-theoretic criteria for all attributes
  3.     Choose best attribute according to the information-theoretic criteria
  4.     Create a decision node based on the best attribute in step 3
  5.     Induce (i.e. split) the dataset based on newly created decision node in step 4
  6.     For all sub-dataset in step 5, call C4.5 algorithm to get a sub-tree (recursive call)
  7.     Attach the tree obtained in step 6 to the decision node in step 4
  8.     Return tree

Continue reading C4.5 Decision Tree Implementation

Decision Tree Induction

After learning some basics about Machine Learning (ML), time to get into the details related to my thesis. After discussing with my supervisors, we decided to implement classification algorithm based on decision tree. So, in this post, I would like to give an overview about decision-tree in ML.

An example of decision tree from XKCD
An example of decision tree from XKCD 😉

What is decision-tree?

Decision-tree is the common output of a divide-and-conquer approach in learning from a set of independent instances. A decision tree consists of nodes and branches. Each node consists of questions based on one or several attributes i.e. compares an attribute value with a constant or it could compare more than one attributes using some functions. Learning data set to produce a decision tree is often called tree-induction. Continue reading Decision Tree Induction

Bootstrapping Machine Learning

My thesis will be related to machine learning(ML), therefore, I need to learn the necessary ML knowledge to do the project. In this post, I would like to revisit some concepts and materials that I used to start learning about ML. Feel free to comment and give suggestions!

Machine Learning is not statistics and not data-mining, but it is in between them. ML is more like automated application of statistics to perform data mining tasks i.e. ML develops algorithms for making predictions from data. Note that predictions in this context refers to statistical-prediction.

Not this kind of Machine Learning though :p

Data in ML consists of data instances. The data instances are represented as feature vectors. Continue reading Bootstrapping Machine Learning

Thesis: Pilot

After one and half month starting my master thesis, finally I have chance to start writing about it. And after getting the permission from one of my supervisors, Gianmarco, I can publish this post, yay!

In this pilot post, I would like to give overview of the thesis. In a nutshell, the thesis is about achieving high velocity in big data analytics, by developing distributed streaming machine learning framework. So, without further ado, here is the overview. 😀

Yes, my thesis is related to big data analysis
Yes, my thesis is related to big data analysis

*the above cartoon image is taken from Space & Light’s Flickr

The Overview

Everyone is talking about big data and one of the initial questions surrounding big data is “how to store it?”. Continue reading Thesis: Pilot

Towards High Availability in YARN: Implementations and Experiments

This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.

Implementation

As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

Continue reading Towards High Availability in YARN: Implementations and Experiments

Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS). The project members are Arinto (me :p) and Mário. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling and mentored by Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Continue reading Towards High Availability in YARN: Motivation and Proposed Solution

  1. Apache Hadoop YARN Background and Overview []

Dremel – Paper Review

Time is really ticking and somehow this semester I do not able to post as often as last semester.. Well, let’s start posting again.. hehe

I did paper review on Dremel (or here for ACM version) as part of ID2220 (Advanced Topics in Distributed System assignment) and here is the summary of my review. I also attached very nice slides on Dremel done by my classmate, Maria, at the end of this post.

What is Dremel?

Data analytics platform that allows interactive/ad-hoc exploration for web-scale data sets. Continue reading Dremel – Paper Review

last.fm Crawler

Two weeks ago, I, Mario & Zafar had mini project to crawl last.fm’s social graph. We performed Random Walk in the social graph and collected the user data such as age, playcounts and number of playlists. Using the collected data, we estimated the property of last.fm user using simple average and normalized average by the number of friends that user has (node degree).  The detail of the project can be found in Mario’s post, and I attached the project slides for easy reference:

 

 

The Curious Case of Consistency – Part 2

Arggghh.. I broke my promise!! I should have finished this post earlier.. :(. huffff..  I was busy with school assignments and activities with Indonesian societies in Stockholm hehe.. maybe I should write on it as well humm… okay, now back to business 🙂

In the previous post, I wrote about several consistency types from Doug Terry‘s breakfast talk in my school. Now, it’s time to see their application in simple baseball game.

Simple Baseball Game

The baseball game itself will consist of several “entities” that are “interested” in the latest score of the game. The “entities” are represented as pseudocode, and the term “interested” can be interpreted as read or write depending on entity type. We will discuss what kind of consistency that is needed for each entity below

Continue reading The Curious Case of Consistency – Part 2