Distributed Streaming Classification: Related Work

In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification


Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.

Continue reading Distributed Streaming Classification: Related Work

Hoeffding Tree for Streaming Classification

In the previous post, we have summarized C4.5 decision tree induction. Well, since my thesis is about distributed streaming machine learning, it’s time to talk about streaming decision tree induction and I think it’s better start with defining “streaming machine learning” in general.

Streaming Machine Learning

Streaming machine learning can be interpreted as performing machine learning in streaming setting. In this case, streaming setting is characterized by:

  • High data volume and rate, such as transactions logs in ATM and credit card operations, call log in telecommunication company, and social media data i.e. Twitter tweet stream or Facebook status update stream
  • Unbounded, which means these data always arrive to our system and we won’t be able to fit them in memory or disk for further analysis with the techniques. Therefore, this characteristic implies we are limited to analyse the data once and there is little chance to revisit the data

Continue reading Hoeffding Tree for Streaming Classification

C4.5 Decision Tree Implementation

It’s time to go deeper in decision tree induction. In this post, I’ll give summary on real-world implementation (i.e. the implementation has been used in actual data mining scenario) called C4.5.


C4.5 is collection of algorithms for performing classifications in machine learning and data mining. It develops the classification model as a decision tree. C4.5 consists of three groups of algorithm: C4.5, C4.5-no-pruning and C4.5-rules. In this summary, we will focus on the basic C4.5 algorithm


In a nutshell, C4.5 is implemented recursively with this following sequence

  1.     Check if algorithm satisfies termination criteria
  2.     Computer information-theoretic criteria for all attributes
  3.     Choose best attribute according to the information-theoretic criteria
  4.     Create a decision node based on the best attribute in step 3
  5.     Induce (i.e. split) the dataset based on newly created decision node in step 4
  6.     For all sub-dataset in step 5, call C4.5 algorithm to get a sub-tree (recursive call)
  7.     Attach the tree obtained in step 6 to the decision node in step 4
  8.     Return tree

Continue reading C4.5 Decision Tree Implementation

Decision Tree Induction

After learning some basics about Machine Learning (ML), time to get into the details related to my thesis. After discussing with my supervisors, we decided to implement classification algorithm based on decision tree. So, in this post, I would like to give an overview about decision-tree in ML.

An example of decision tree from XKCD
An example of decision tree from XKCD 😉

What is decision-tree?

Decision-tree is the common output of a divide-and-conquer approach in learning from a set of independent instances. A decision tree consists of nodes and branches. Each node consists of questions based on one or several attributes i.e. compares an attribute value with a constant or it could compare more than one attributes using some functions. Learning data set to produce a decision tree is often called tree-induction. Continue reading Decision Tree Induction