Effective Elasticsearch Mapping

Finally, after more than a year of hiatus, I start writing again on this blog! This time, I’ll write about Effective Elasticsearch Mapping, i.e. some tips on how to define our mapping in Elasticsearch 1.7.x. We just started using Elasticsearch 2.0 in LARC (my current workplace :)), so I’ll do my best to update this list accordingly as we grow our experience in Elasticsearch 2.0.

1. Use appropriate data type for an ES field

Elasticsearch will try its best to determine the data type of an unknown field when we index a document. However, it’s better to use appropriate data type for ES field, hence define your mapping early (i.e. before you start to index the documents) and use index template1.

Some of the data types that you should use:

  1. date type for timestamp.
    Don’t use long or string type although they may look promising. Using date type allow us to use the index easily in Kibana and support time-aggregation without the needs of scanning.
  2. geo_point type for geo location (i.e. latitude-longitude) Continue reading Effective Elasticsearch Mapping

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html []

Bootstrapping Twitter Storm

I have chances to use Twitter Storm for my thesis and in this post I would like  to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project 🙂

Of course I am NOT talking about this movie :D
Of course I am NOT talking about this movie 😀

Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.

What is Twitter Storm?

Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading Bootstrapping Twitter Storm

Distributed Streaming Classification: Related Work

In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification

Classification

Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.

Continue reading Distributed Streaming Classification: Related Work

Towards High Availability in YARN: Implementations and Experiments

This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.

Implementation

As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

Continue reading Towards High Availability in YARN: Implementations and Experiments

Towards High Availability in YARN: Motivation and Proposed Solution

Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS). The project members are Arinto (me :p) and Mário. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling and mentored by Vasia Kalavri.

This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.

Problem statement

YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1.  However, it still has one limitation, which is on its availability.

Continue reading Towards High Availability in YARN: Motivation and Proposed Solution

  1. Apache Hadoop YARN Background and Overview []

The Curious Case of Consistency – Part 2

Arggghh.. I broke my promise!! I should have finished this post earlier.. :(. huffff..  I was busy with school assignments and activities with Indonesian societies in Stockholm hehe.. maybe I should write on it as well humm… okay, now back to business 🙂

In the previous post, I wrote about several consistency types from Doug Terry‘s breakfast talk in my school. Now, it’s time to see their application in simple baseball game.

Simple Baseball Game

The baseball game itself will consist of several “entities” that are “interested” in the latest score of the game. The “entities” are represented as pseudocode, and the term “interested” can be interpreted as read or write depending on entity type. We will discuss what kind of consistency that is needed for each entity below

Continue reading The Curious Case of Consistency – Part 2

The Curious Case of Consistency – Part 1

Last week I attended interesting breakfast talk by Doug Terry, principal researcher from Microsoft Research Silicon Valley and I think it is pretty cool talk B-). The talk itself is about other consistency types which lie between Strong and Eventual Consistency, and how the additional consistency types are (practically) used through simple pseudo-code of baseball game. And this is the first of two planned posts for this topic.

Let’s start with two most basic consistency model (which we usually learn in our first Distributed System course/module in University): Strong Consistency and Eventual Consistency.

In Strong Consistency, every node in distributed system will always see the same and latest view after a specific node update the data. This consistency type sacrifices performance and availability in order to ensure consistent view across the node in the distributed system since it is expensive in term of network and computing resources needed to maintain the consistency. Example of system that uses Strong Consistency is Windows Azure.  Strong Consistency is desirable within a data center however performance and availability concerns start raising in geo-replication service that spans in multiple-data center.

I use strong consistency!

Continue reading The Curious Case of Consistency – Part 1