Finally, after more than a year of hiatus, I start writing again on this blog! This time, I’ll write about Effective Elasticsearch Mapping, i.e. some tips on how to define our mapping in Elasticsearch 1.7.x. We just started using Elasticsearch 2.0 in LARC (my current workplace :)), so I’ll do my best to update this list accordingly as we grow our experience in Elasticsearch 2.0.
1. Use appropriate data type for an ES field
Elasticsearch will try its best to determine the data type of an unknown field when we index a document. However, it’s better to use appropriate data type for ES field, hence define your mapping early (i.e. before you start to index the documents) and use index template1.
Some of the data types that you should use:
date type for timestamp.
Don’t use long or string type although they may look promising. Using date type allow us to use the index easily in Kibana and support time-aggregation without the needs of scanning.
Well, actually it is pretty straightforward, just follow this Play 2.2.x documentation. But there is a caveat that costs me several hours to resolve.
Based on Play 2.2.x documentation , we should use `-Dlogger.resource` after the command `start` inside Play console, i.e
[OS-console] $ play
[info] Loading project definition from ....
..... //initialization message from Play
[play-console] $ start -Dlogger.resource=my-logging-configuration.xml
But what if we want to invoke it outside Play console? Does this command below work properly?
play start -Dlogger.resource=my-logging-configuration.xml
Well the answers is NO!
So, I add some code on my controller to print the logging configuration file (which is logback configuration file) and Play does not consider the VM arguments (i.e. -Dlogger.resource) using above command. Logback uses the default Play’s logger.xml in `$PROJECT_HOME/play-2.2.x/repository/local/com.typesafe.play/play_2.10/2.2.x/jars/play_2.10.jar!/logger.xml`
So, to resolve this, we should invoke the “play start” command as below:
play -Dlogger.resource=my-logging-config.xml -Dother.vmarg=val start
That’s all for this month! See you (hopefully earlier than) next month!
It has been 4 months since my last post. I have been busy mainly with my job with LARC (and other extra commitments such as classes at Masjid) and could not allocate time to continue updating this blog :(. But from now, I’ll allocate one of my weekends every month to update this blog. 🙂 *wish me all the best!! 🙂
I’ve just received my new MacBook Pro, and just upgraded to OS X 10.9.1. And I found that setting up my usual Java development environment is not as straightforward as in Linux (the steps are similar, but not the same!!). Hence, here I present how I setup my Java development environment.
Apple actually has already provided its own JDK (it will prompt you to install it actually), but unfortunately it is still JDK 6. Unless you’re content with JDK6, you can easily install JDK7 from Oracle website and follow the steps in here. In the end of the installation process, you should check the “java” and “javac” version and you should also have Java control panel in “System Preferences”.
And don’t forget to setup the JAVA_HOME environment variable by creating .bash_profile if it doesn’t exist and put this script [reference: article]
export JAVA_HOME=`/usr/libexec/java_home -v 1.7`
Well, it is almost straight forward. Download your Eclipse (*.tar.gz) from its website. Extract the tar.gz file into an ‘eclipse’ folder. Then, just copy or move the ‘eclipse’ folder into OS X’s Applications folder. Make sure you have this folder structure in you Applications folder.
Then, you can drag-and-drop the Eclipse launcher into Dock for easy access. Now, click the Eclipse launcher and see what happened. If your Eclipse run properly, than it’s done!
But.. if Mavericks prompts you to install Apple JRE and you want to use JRE 7, then you need to do this workaround outlined in this discussion. (don’t forget to upvote the answer! :))
Mavericks does not include Maven by default. Hence, the easiest way is to use Homebrew to install it. Follow this instruction if you need to install older version of Maven (Maven2 or Maven 3.0.5).
Voila! You have JDK, Eclipse, and Maven installed in your OSX Mavericks! That’s all for now, feel free to give comments and suggestions!
I am currently learning Scala through Twitter’s Scala school, and I found that “Adding Dependencies”, “Packaging and Publishing” and “Adding Tasks” tutorials at Simple Build Tool section are outdated. And in this short tutorial, I’ll share how I perform the outdated tutorials using sbt 0.13.0.
Instead of creating a .scala file in project/build path, we can easily use the build.sbt file to add the required dependencies. Here is the snippet that should be added to build.sbt:
In this post, I’ll give a quick overview of upcoming distributed streaming machine learning framework, Scalable Advanced Massive Online Analysis (SAMOA). As I mentioned before, SAMOA is part of my and Antonio’s theses with Yahoo! Labs Barcelona.
What is SAMOA?
SAMOA is a tool to perform mining on big data streams. It is a distributed streaming machine learning (ML) framework, i.e. it is a Mahout but for stream mining. SAMOA contains a programing abstraction for distributed streaming ML algorithms (refer to this post for stream ML definition) to enable development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Twitter Storm and S4). SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and they can execute the algorithms in multiple SPEs, i.e. code the algorithms once and execute them in multiple SPEs.
In this post, we will revisit several parallelism types that can be applied to modify conventional streaming (or online) machine learning algorithms into distributed and parallel ones. This post is a quick summary of half of chapter 4 of my thesis (which I completed one month ago! yay!).
Data Parallelism parallelize and distribute the algorithms based on the data. There are two types of data parallelism, they are Vertical Parallelism and Horizontal Parallelism.
Horizontal parallelism splits the data based on the quantity of the data i.e. same amount of data subset goes into the parallel computation. If let’s say we have 4 components that perform parallel computation, and we have 100 data, then each component computes 25 data. As shown in figure below, each parallel component has local machine learning (ML) model. Every parallel component then performs periodical update into the global ML model.
This type of parallelism is often used to provide horizontal scalability. In online learning context, horizontal parallelism is suitable when the data arrival rate is very high. However, horizontal parallelism needs high number of memory since it needs to replicate the online machine learning model in every parallel computation element. Another caveat for horizontal parallelism is the additional complexity that introduced when propagating the model updates between parallel computation element. Example of horizontal parallelism in distributed streaming machine learning algorithm is Ben-Haim and Yom-Tov’s work about streaming parallel decision tree algorithm.
I have chances to use Twitter Storm for my thesis and in this post I would like to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project 🙂
Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.
What is Twitter Storm?
Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading Bootstrapping Twitter Storm
In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification
Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.
This post is a follow-up post about our project, High Availability in YARN. In the previous post, we have explained the motivation and our proposed solution to solve availability problem in YARN. Now, let’s continue with the implementations and experiments that we have done as proofs of concepts for our proposed solution.
As a proof-of-concept of our proposed architecture, we designed and implemented NDB storage module for YARN resource-manager. Due to limited time, recovery failure model was used in our implementation. In this post, we will refer the proof-of-concept of NDB-based-YARN as YARN-NDB.