Finally, after more than a year of hiatus, I start writing again on this blog! This time, I’ll write about Effective Elasticsearch Mapping, i.e. some tips on how to define our mapping in Elasticsearch 1.7.x. We just started using Elasticsearch 2.0 in LARC (my current workplace :)), so I’ll do my best to update this list accordingly as we grow our experience in Elasticsearch 2.0.
1. Use appropriate data type for an ES field
Elasticsearch will try its best to determine the data type of an unknown field when we index a document. However, it’s better to use appropriate data type for ES field, hence define your mapping early (i.e. before you start to index the documents) and use index template1.
Some of the data types that you should use:
date type for timestamp.
Don’t use long or string type although they may look promising. Using date type allow us to use the index easily in Kibana and support time-aggregation without the needs of scanning.
I have chances to use Twitter Storm for my thesis and in this post I would like to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project 🙂
Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.
What is Twitter Storm?
Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading Bootstrapping Twitter Storm
In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification
Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.
This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.
As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.
Finally, it’s the end of my 3rd semester with EMDC and I would like to share our latest project: High Availability in YARN. This project is collaboration between EMDC and Swedish Institute of Computer Science (SICS). The project members are Arinto (me :p) and Mário. Our project partners are Umit and Strahinja (they worked on node-manager of YARN). And this project is supervised by Jim Dowling and mentored by Vasia Kalavri.
This post explains the motivation behind the project and our proposed solution. The follow-up post explains the implementations and experiments as proofs of concept of our solutions.
YARN solves scalability issues of previous MapReduce framework. It also offers flexibility in executing the computation framework on top of a cluster where YARN is deployed1. However, it still has one limitation, which is on its availability.
Arggghh.. I broke my promise!! I should have finished this post earlier.. :(. huffff.. I was busy with school assignments and activities with Indonesian societies in Stockholm hehe.. maybe I should write on it as well humm… okay, now back to business 🙂
In the previous post, I wrote about several consistency types from Doug Terry‘s breakfast talk in my school. Now, it’s time to see their application in simple baseball game.
Simple Baseball Game
The baseball game itself will consist of several “entities” that are “interested” in the latest score of the game. The “entities” are represented as pseudocode, and the term “interested” can be interpreted as read or write depending on entity type. We will discuss what kind of consistency that is needed for each entity below
Last week I attended interesting breakfast talk by Doug Terry, principal researcher from Microsoft Research Silicon Valley and I think it is pretty cool talk B-). The talk itself is about other consistency types which lie between Strong and Eventual Consistency, and how the additional consistency types are (practically) used through simple pseudo-code of baseball game. And this is the first of two planned posts for this topic.
Let’s start with two most basic consistency model (which we usually learn in our first Distributed System course/module in University): Strong Consistency and Eventual Consistency.
In Strong Consistency, every node in distributed system will always see the same and latest view after a specific node update the data. This consistency type sacrifices performance and availability in order to ensure consistent view across the node in the distributed system since it is expensive in term of network and computing resources needed to maintain the consistency. Example of system that uses Strong Consistency is Windows Azure. Strong Consistency is desirable within a data center however performance and availability concerns start raising in geo-replication service that spans in multiple-data center.
This is the follow up of the project in this post. Our professor asked us to perform scalability analysis of the technology that we used, and write simple report on it. I decided to analyze Flume scalability in term of the number of events that can be supported by the Flume configuration. The project itself is inspired from Mike Percy’s Flume-NG performance measurement, and I re-used some of his software components. There are two main differences between this project and Mike’s work, which are:
This experiment introduces one-to-one relationship between the nodes and Flume load generator. Tht means, each Flume load generator process exists in an independent node (which is Amazon EC2 medium instance).
This experiment introduces cascading setup, which will verify whether there is improvement in scalability or not compared to non-cascading setup
This is the follow-up-post of this post about EEDC project this semester. Well, I would like to have more time to polish my slides but that’s fine. It’s already past 🙂
I used historical perspective to discuss about the “rise” of network virtualization, in this context I focused on SDN (Software-Defined Networking) and it is all started in 2007. At that time, network are faster, but not better. It has several limitations such as high complexity in maintaining it, high possibility of policy inconsistent across devices in the network, inability to scale, and dependency on vendor. On the other hand, the need of new network architecture is crucial due to traffic pattern change (not only from client to server or vice versa, but also between nodes in server cluster), consumerization of IT, and rise of cloud service and Big Data. Continue reading Rise of Network Virtualization – Final
I’m planning to cover Network Virtualization topic for my next project in EEDC course. Main inspiration comes from this article from Technology Review, about Nicira. And here is my plan for the task that I present to our professor.
Network Virtualization is the next trend in Virtualization after OS virtualization. It allows user to easily reconfigure a network configuration in a cloud computing environment as well as increase security of the network. Continue reading Rise of Network Virtualization