In this post, we will revisit several parallelism types that can be applied to modify conventional streaming (or online) machine learning algorithms into distributed and parallel ones. This post is a quick summary of half of chapter 4 of my thesis (which I completed one month ago! yay!).
Data Parallelism parallelize and distribute the algorithms based on the data. There are two types of data parallelism, they are Vertical Parallelism and Horizontal Parallelism.
Horizontal parallelism splits the data based on the quantity of the data i.e. same amount of data subset goes into the parallel computation. If let’s say we have 4 components that perform parallel computation, and we have 100 data, then each component computes 25 data. As shown in figure below, each parallel component has local machine learning (ML) model. Every parallel component then performs periodical update into the global ML model.
This type of parallelism is often used to provide horizontal scalability. In online learning context, horizontal parallelism is suitable when the data arrival rate is very high. However, horizontal parallelism needs high number of memory since it needs to replicate the online machine learning model in every parallel computation element. Another caveat for horizontal parallelism is the additional complexity that introduced when propagating the model updates between parallel computation element. Example of horizontal parallelism in distributed streaming machine learning algorithm is Ben-Haim and Yom-Tov’s work about streaming parallel decision tree algorithm.