In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification
Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.
These classification algorithms have their own advantages and disadvantages, depending on many factors such as the characteristics of analyzed data and results. For more details about classification algorithm characteristics, you can start by checking Caruana and Niculescu-Mizil’s work where ten supervised learning methods were empirically evaluated and compared. They followed up the work with further evaluation and comparison in high-dimensions data. And with regards to decision tree induction, C4.5 decision tree induction belongs to this category.
Distributed and parallel Classification
Classifications (and also machine learning in general) always assume that the training and testing data are available in the memory, hence the associated algorithms can easily perform data analysis on it (such as perform multiple passes on training data to enhance machine learning model). However, the abundance of data and the need to process larger amounts of data have triggered machine learning development. Classic classification algorithms are modified into scaled-up versions, i.e. executing the algorithm in multiple machines with message-passing or shared-memory paradigm or expanding the limited memory with temporary external storage.
Distributed and parallel machine learning frameworks are pretty common nowadays such as: GraphLab & Distributed GraphLab, Pregel and MLBase. There are also some works that try to utilize MapReduce as for distributed machine learning such as Gillick et. al. paper. With regards to decision tree induction, these are the publications that propose distributed decision tree induction:
- PLANET: massively parallel learning of tree ensembles with MapReduce (Panda et. al., paper). It used task parallelism of decision tree induction using master (controller) and worker model.
- Stochastic gradient boosted distributed decision trees (Ye et. al., paper). The authors implemented two version of distributed decision trees: using horizontal-data-partitioning on top of MapReduce and using vertical-data-partitioning on top of MPI on Hadoop.
- Characterization and parallelization of decision-tree Induction (Bradford and Fortes, paper)
- Chi-square test based decision trees induction in distributed environment(Ouyang et. al, paper.)
Another development of machine learning is in processing continuous supply of data. In conventional classification and machine learning implementations, an induced model (which is the training result), is not designed to be updated with the arrival of new data. The conventional training needs to be performed again from the beginning with the new arrived data and it is costly and time-consuming. This characteristic is often undesirable because it causes the overall machine learning and classification algorithm slows and could not handle the continuous supply of data.
However, fret not! Data stream paradigm has emerged to address the aforementioned continuous data supply challenge. And as I explained in my previous post, there are two main characteristics of data stream paradigm: high data volume & rate, and unbounded. And there are four characteristics of streaming-machine-learning-implementation: process example at a time & inspect it exactly once, use limited amount of memory, bounded processing time, able to predict at any time.
Here are some of the machine learning frameworks that aim to handle data stream paradigm are
- MOA: Massive Online Analysis. It’s a Java-based framework containing classification, clustering, and regression streaming algorithm implementation. (overview, manual)
- Vowpal Wabbit. It is a C++ based framework which uses online gradient descent as its principal learning algorithm.
- Debellor. Another Java-based machine learning framework that aims for scalable data mining through data streaming architecture. (paper)
With regards to decision tree induction, here are some publications that propose or discuss streaming decision tree and/or machine learning
- Mining high-speed data streams (Domingos and Hulten, paper). The authors proposed streaming decision tree induction based on Hoeffding bound, and implemented their proposed as VFDT (Very Fast Decision Tree) learner.
- Efficient decision tree construction on streaming data (Jin and Agrawal, paper). The paper is about another proposal on streaming decision tree induction.
- Mining time-changing data streams (Hulten et. al., paper). The authors proposed Streaming decision tree learner that reacts properly with time-changing data streams i.e. data stream that can change its characteristics so that the current learner is not suitable.
- A streaming ensemble algorithm (SEA) for large-scale classification (Street and Kim, paper).
- Mining data streams: a review (Gaber et. al., paper)
- Issues in evaluation of stream learning algorithms (Gama et. al., paper)
Distributed Streaming Classification
Nowadays the amount of data consumed by web-companies has reached petabytes scale. This type of data is commonly called Big Data. Other than magnitude, the data velocity of the data is also increasing significantly. Existing machine learning frameworks in previous section are designed to be executed in single-machine settings hence they are likely not able to reach high velocity big data processing.
To achieve this high-velocity requirement, we propose to build distributed streaming machine learning framework (which is my thesis :D).
There is not many related works in distributed or parallel streaming machine learning framework but so far I managed to obtain the following references:
- Jubatus. Scalable distributed online machine learning framework for real time big data analysis. Developed by Makino from NTT Software Innovation center. (check out the slides from XLDB2012 to get high-level-view of Jubatus)
- A streaming parallel decision tree algorithm (Ben-Haim and Tom-Tov, paper). The authors proposed approximate algorithm for streaming data to build decision tree. The advantages of their algorithm are its ability to be executed in parallel, hence it is faster compared to serial implementation without sacrificing the error-rate performance
Well, that’s all for the related works for now. I may add more during my actual thesis writing. Hohoho. For next post, I’m planning to discuss about Twitter Storm. So, stay tune!