Effective Elasticsearch Mapping

Finally, after more than a year of hiatus, I start writing again on this blog! This time, I’ll write about Effective Elasticsearch Mapping, i.e. some tips on how to define our mapping in Elasticsearch 1.7.x. We just started using Elasticsearch 2.0 in LARC (my current workplace :)), so I’ll do my best to update this list accordingly as we grow our experience in Elasticsearch 2.0.

1. Use appropriate data type for an ES field

Elasticsearch will try its best to determine the data type of an unknown field when we index a document. However, it’s better to use appropriate data type for ES field, hence define your mapping early (i.e. before you start to index the documents) and use index template1.

Some of the data types that you should use:

  1. date type for timestamp.
    Don’t use long or string type although they may look promising. Using date type allow us to use the index easily in Kibana and support time-aggregation without the needs of scanning.
  2. geo_point type for geo location (i.e. latitude-longitude) Continue reading Effective Elasticsearch Mapping

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html []

Java 8 at a glance

I’ve just attended a talk about Java 8 by Chuk Lee at NUS Hackers meetup, and here’s the summary. If you’re impatient and want to try the code,  go directly to java-8-at-a-glance repository. 🙂

Chuk started the presentation with short history of JDK, some points to note:

  • Oracle acquired Sun on 2010, while JDK  7 was still under development
  • At that time JDK 7 was very ambitious project, trying to make JDK more modular (i.e. we can choose which component of JDK to run our app), add lambda expressions and many more.
  • Oracle asked Java community to vote, whether to release JDK 7 ASAP with incomplete intended features, or wait longer but with more complete features.
  • The Java community chose the first option, and JDK 7 was released on 2011 (5 years after JDK 6 released) with decent features[1] such as NIO2, type inference in generics, and  fork/join framework.
  • However the promised modularity and lambda expression are not available in JDK 7.
  • And here they are.. Java 8!

So, what are the highlights in Java 8?

  • On language point of view: lambda expression, interface evolution
  • On libraries point of view: stream & bulk data operations on collections
  • On platform point of view: profiles option, a step towards modularity

Nope, we are not talking about Java Island
Nope, we are not talking about Java Island[9]
Continue reading Java 8 at a glance

Configuring Logging in Play 2.2.x

Well, actually it is pretty straightforward, just follow this Play 2.2.x documentation. But there is a caveat that costs me several hours to resolve.

Play Framework
Play Framework

Based on Play 2.2.x documentation , we should use `-Dlogger.resource` after the command `start` inside Play console, i.e

[OS-console] $ play
[info] Loading project definition from ....
[info] ...
..... //initialization message from Play
[play-console] $ start -Dlogger.resource=my-logging-configuration.xml

But what if we want to invoke it outside Play console? Does  this command below work properly?

play start -Dlogger.resource=my-logging-configuration.xml

Well the answers is NO!

So, I add some code on my controller to print the logging configuration file (which is logback configuration file) and Play does not consider the VM arguments (i.e. -Dlogger.resource) using above command. Logback  uses the default Play’s logger.xml in  `$PROJECT_HOME/play-2.2.x/repository/local/com.typesafe.play/play_2.10/2.2.x/jars/play_2.10.jar!/logger.xml`

So, to resolve this, we should invoke the “play start” command as below:

play -Dlogger.resource=my-logging-config.xml -Dother.vmarg=val start

That’s all for this month! See you (hopefully earlier than) next month!

Java, Eclipse, & Maven on Mac OS X Mavericks

Hola a todos!!!

It has been 4 months since my last post. I have been busy mainly with my job with LARC (and other extra commitments such as classes at Masjid) and could not allocate time to continue updating this blog :(. But from now, I’ll allocate one of my weekends every month to update this blog. 🙂 *wish me all the best!! 🙂

I’ve just received my new MacBook Pro, and just upgraded to OS X 10.9.1. And I found that setting up my usual Java development environment is not as straightforward as in Linux (the steps are similar, but not the same!!). Hence, here I present how I setup my Java development environment.


Apple actually has already provided its own JDK (it will prompt you to install it actually), but unfortunately it is still JDK 6. Unless you’re content with JDK6, you can easily install JDK7 from Oracle website and follow the steps in here. In the end of the installation process, you should check the “java” and “javac” version and you should also have Java control panel in “System Preferences”.

java and javac version
java and javac version


System Preferences with Java control panel
System Preferences with Java control panel

And don’t forget to setup the JAVA_HOME environment variable by creating .bash_profile if it doesn’t exist and put this script [reference: article]

export JAVA_HOME=`/usr/libexec/java_home -v 1.7`


Well, it is almost straight forward. Download your Eclipse (*.tar.gz) from its website. Extract the tar.gz file into an ‘eclipse’ folder. Then, just copy or move the ‘eclipse’ folder into OS X’s Applications folder. Make sure you have this folder structure in you Applications folder.

Eclipse folder structure in Applications folder
Eclipse folder structure in Applications folder

Then, you can drag-and-drop the Eclipse launcher into Dock for easy access. Now, click the Eclipse launcher and see what happened. If your Eclipse run properly, than it’s done!

But.. if Mavericks prompts you to install Apple JRE and you want to use JRE 7, then you need to do this workaround outlined in this discussion. (don’t forget to upvote the answer! :))


Mavericks does not include Maven by default. Hence, the easiest way is to use Homebrew to install it. Follow this instruction if you need to install older version of Maven (Maven2 or Maven 3.0.5).

Voila! You have JDK, Eclipse, and Maven installed in your OSX Mavericks! That’s all for now, feel free to give comments and suggestions!

Outdated sbt Tutorials at Twitter’s Scala School

I am currently learning Scala through Twitter’s Scala school, and I found that “Adding Dependencies”, “Packaging and Publishing” and “Adding Tasks” tutorials at Simple Build Tool section are outdated. And in this short tutorial, I’ll share how I perform the outdated tutorials using sbt 0.13.0.

This is Scala logo..
The logo of Scala

Adding Dependencies

Instead of creating a .scala file in project/build path, we can easily use the build.sbt file to add the required dependencies. Here is the snippet that should be added to build.sbt:

libraryDependencies ++= Seq(
"org.scala-tools.testing" % "specs_2.10" % "1.6.+" % "test",
"org.codehaus.jackson" % "jackson-core-asl" % "1.9.+"

Continue reading Outdated sbt Tutorials at Twitter’s Scala School

SAMOA – Scalable Advanced Massive Online Analysis

In this post, I’ll give a quick overview of upcoming distributed streaming machine learning framework,  Scalable Advanced Massive Online Analysis (SAMOA). As I  mentioned before, SAMOA is part of my and Antonio’s theses with Yahoo! Labs Barcelona.

What is SAMOA?

SAMOA is a tool to perform mining on big data streams. It is a distributed streaming machine learning  (ML) framework, i.e. it is a Mahout but for stream mining. SAMOA contains a programing abstraction for distributed streaming ML algorithms (refer to this post for stream ML definition) to enable development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Twitter Storm and S4).  SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and they can execute the algorithms in multiple SPEs, i.e. code the algorithms once and execute them in multiple SPEs.

Continue reading SAMOA – Scalable Advanced Massive Online Analysis

Parallelism for Distributed Streaming ML

In this post, we will revisit several parallelism types that can be applied to modify conventional streaming (or online) machine learning algorithms into distributed and parallel ones. This post is a quick summary of half of chapter 4 of my thesis (which I completed one month ago! yay!).

Data Parallelism

Data Parallelism parallelize and distribute the algorithms based on the data. There are two types of data parallelism, they are Vertical Parallelism and Horizontal Parallelism.

Horizontal Parallelism

Horizontal parallelism splits the data based on the quantity of the data i.e. same amount of data subset goes into the parallel computation. If let’s say we have 4 components that perform parallel computation, and we have 100 data, then each component computes 25 data. As shown in figure below, each parallel component has local machine learning (ML) model. Every parallel component then performs periodical update into the global ML model. Horizontal Parallelism

This type of parallelism is often used to provide horizontal scalability. In online learning context, horizontal parallelism is suitable when the data arrival rate is very high. However, horizontal parallelism needs high number of memory since it needs to replicate the online machine learning model in every parallel computation element. Another caveat for horizontal parallelism is the additional complexity that introduced when propagating the model updates between parallel computation element. Example of horizontal parallelism in distributed streaming machine learning algorithm is Ben-Haim and Yom-Tov’s work about streaming parallel decision tree algorithm.

Continue reading Parallelism for Distributed Streaming ML

Bootstrapping Twitter Storm

I have chances to use Twitter Storm for my thesis and in this post I would like  to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project 🙂

Of course I am NOT talking about this movie :D
Of course I am NOT talking about this movie 😀

Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.

What is Twitter Storm?

Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading Bootstrapping Twitter Storm

Distributed Streaming Classification: Related Work

In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification


Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.

Continue reading Distributed Streaming Classification: Related Work

Towards High Availability in YARN: Implementations and Experiments

This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.


As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.

Continue reading Towards High Availability in YARN: Implementations and Experiments