Java 8 at a glance

I’ve just attended a talk about Java 8 by Chuk Lee at NUS Hackers meetup, and here’s the summary. If you’re impatient and want to try the code,  go directly to java-8-at-a-glance repository. :)

Chuk started the presentation with short history of JDK, some points to note:

  • Oracle acquired Sun on 2010, while JDK  7 was still under development
  • At that time JDK 7 was very ambitious project, trying to make JDK more modular (i.e. we can choose which component of JDK to run our app), add lambda expressions and many more.
  • Oracle asked Java community to vote, whether to release JDK 7 ASAP with incomplete intended features, or wait longer but with more complete features.
  • The Java community chose the first option, and JDK 7 was released on 2011 (5 years after JDK 6 released) with decent features[1] such as NIO2, type inference in generics, and  fork/join framework.
  • However the promised modularity and lambda expression are not available in JDK 7.
  • And here they are.. Java 8!

So, what are the highlights in Java 8?

  • On language point of view: lambda expression, interface evolution
  • On libraries point of view: stream & bulk data operations on collections
  • On platform point of view: profiles option, a step towards modularity

Nope, we are not talking about Java Island
Nope, we are not talking about Java Island[9]
Continue reading

Configuring Logging in Play 2.2.x

Well, actually it is pretty straightforward, just follow this Play 2.2.x documentation. But there is a caveat that costs me several hours to resolve.

Play Framework
Play Framework

Based on Play 2.2.x documentation , we should use `-Dlogger.resource` after the command `start` inside Play console, i.e

[OS-console] $ play
[info] Loading project definition from ....
[info] ...
..... //initialization message from Play
[play-console] $ start -Dlogger.resource=my-logging-configuration.xml

But what if we want to invoke it outside Play console? Does  this command below work properly?

play start -Dlogger.resource=my-logging-configuration.xml

Well the answers is NO!

So, I add some code on my controller to print the logging configuration file (which is logback configuration file) and Play does not consider the VM arguments (i.e. -Dlogger.resource) using above command. Logback  uses the default Play’s logger.xml in  `$PROJECT_HOME/play-2.2.x/repository/local/com.typesafe.play/play_2.10/2.2.x/jars/play_2.10.jar!/logger.xml`

So, to resolve this, we should invoke the “play start” command as below:

play -Dlogger.resource=my-logging-config.xml -Dother.vmarg=val start

That’s all for this month! See you (hopefully earlier than) next month!

Java, Eclipse, & Maven on Mac OS X Mavericks

Hola a todos!!!

It has been 4 months since my last post. I have been busy mainly with my job with LARC (and other extra commitments such as classes at Masjid) and could not allocate time to continue updating this blog :(. But from now, I’ll allocate one of my weekends every month to update this blog. :) *wish me all the best!! :)

I’ve just received my new MacBook Pro, and just upgraded to OS X 10.9.1. And I found that setting up my usual Java development environment is not as straightforward as in Linux (the steps are similar, but not the same!!). Hence, here I present how I setup my Java development environment.

JDK

Apple actually has already provided its own JDK (it will prompt you to install it actually), but unfortunately it is still JDK 6. Unless you’re content with JDK6, you can easily install JDK7 from Oracle website and follow the steps in here. In the end of the installation process, you should check the “java” and “javac” version and you should also have Java control panel in “System Preferences”.

java and javac version
java and javac version

 

System Preferences with Java control panel
System Preferences with Java control panel

And don’t forget to setup the JAVA_HOME environment variable by creating .bash_profile if it doesn’t exist and put this script [reference: article]

export JAVA_HOME=`/usr/libexec/java_home -v 1.7`

Eclipse

Well, it is almost straight forward. Download your Eclipse (*.tar.gz) from its website. Extract the tar.gz file into an ‘eclipse’ folder. Then, just copy or move the ‘eclipse’ folder into OS X’s Applications folder. Make sure you have this folder structure in you Applications folder.

Eclipse folder structure in Applications folder
Eclipse folder structure in Applications folder

Then, you can drag-and-drop the Eclipse launcher into Dock for easy access. Now, click the Eclipse launcher and see what happened. If your Eclipse run properly, than it’s done!

But.. if Mavericks prompts you to install Apple JRE and you want to use JRE 7, then you need to do this workaround outlined in this discussion. (don’t forget to upvote the answer! :))

Maven

Mavericks does not include Maven by default. Hence, the easiest way is to use Homebrew to install it. Follow this instruction if you need to install older version of Maven (Maven2 or Maven 3.0.5).

Voila! You have JDK, Eclipse, and Maven installed in your OSX Mavericks! That’s all for now, feel free to give comments and suggestions!

Outdated sbt Tutorials at Twitter’s Scala School

I am currently learning Scala through Twitter’s Scala school, and I found that “Adding Dependencies”, “Packaging and Publishing” and “Adding Tasks” tutorials at Simple Build Tool section are outdated. And in this short tutorial, I’ll share how I perform the outdated tutorials using sbt 0.13.0.

This is Scala logo..
The logo of Scala

Adding Dependencies

Instead of creating a .scala file in project/build path, we can easily use the build.sbt file to add the required dependencies. Here is the snippet that should be added to build.sbt:

libraryDependencies ++= Seq(
"org.scala-tools.testing" % "specs_2.10" % "1.6.+" % "test",
"org.codehaus.jackson" % "jackson-core-asl" % "1.9.+"
)

Continue reading

SAMOA – Scalable Advanced Massive Online Analysis

In this post, I’ll give a quick overview of upcoming distributed streaming machine learning framework,  Scalable Advanced Massive Online Analysis (SAMOA). As I  mentioned before, SAMOA is part of my and Antonio’s theses with Yahoo! Labs Barcelona.

What is SAMOA?

SAMOA is a tool to perform mining on big data streams. It is a distributed streaming machine learning  (ML) framework, i.e. it is a Mahout but for stream mining. SAMOA contains a programing abstraction for distributed streaming ML algorithms (refer to this post for stream ML definition) to enable development of new ML algorithms without dealing with the complexity of underlying streaming processing engines (SPE, such as Twitter Storm and S4).  SAMOA also provides extensibility in integrating new SPEs into the framework. These features allow SAMOA users to develop distributed streaming ML algorithms once and they can execute the algorithms in multiple SPEs, i.e. code the algorithms once and execute them in multiple SPEs.

Continue reading

Eid Mubarak!

After a month of fasting in the holy month of Ramadan, Muslims around the world (including me! :)) celebrate Eid al-Fitr which means the celebration of the end of fasting. Hence, I would like to wish all Muslims:

Eid Mubarak!

تَقَبَّلَ اللَّهُ مِنَّا وَمِنْك

May Alloh accept it (all the good deed) from us and you[1].

Keep the level of ibadah(observance) high after Ramadan! *a reminder for myself as well

And I hope we’ll have meet Ramadan again next year! *ameeen :)

TaqabbalAllohu minna wa minkum

TaqabbalAllohu minna wa minkum :)

[1] Ucapan Selamat Hari Raya Idul Fitri, from muslim.or.id (In Bahasa Indonesia)

Parallelism for Distributed Streaming ML

In this post, we will revisit several parallelism types that can be applied to modify conventional streaming (or online) machine learning algorithms into distributed and parallel ones. This post is a quick summary of half of chapter 4 of my thesis (which I completed one month ago! yay!).

Data Parallelism

Data Parallelism parallelize and distribute the algorithms based on the data. There are two types of data parallelism, they are Vertical Parallelism and Horizontal Parallelism.

Horizontal Parallelism

Horizontal parallelism splits the data based on the quantity of the data i.e. same amount of data subset goes into the parallel computation. If let’s say we have 4 components that perform parallel computation, and we have 100 data, then each component computes 25 data. As shown in figure below, each parallel component has local machine learning (ML) model. Every parallel component then performs periodical update into the global ML model. Horizontal Parallelism

This type of parallelism is often used to provide horizontal scalability. In online learning context, horizontal parallelism is suitable when the data arrival rate is very high. However, horizontal parallelism needs high number of memory since it needs to replicate the online machine learning model in every parallel computation element. Another caveat for horizontal parallelism is the additional complexity that introduced when propagating the model updates between parallel computation element. Example of horizontal parallelism in distributed streaming machine learning algorithm is Ben-Haim and Yom-Tov’s work about streaming parallel decision tree algorithm.

Continue reading

Bootstrapping Twitter Storm

I have chances to use Twitter Storm for my thesis and in this post I would like  to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project :)

Of course I am NOT talking about this movie :D
Of course I am NOT talking about this movie :D

Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.

What is Twitter Storm?

Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video). Continue reading

Distributed Streaming Classification: Related Work

In this post, I plan to write some quick recap of related works in Distributed Streaming Classification, focusing on decision tree induction. It is still related to my thesis in Distributed Streaming Machine Learning Framework. I divide this post into four sections: Classification, Distributed Classification, Streaming Classification, and Distributed Streaming Classification. Without further ado, let’s start with Classification

Classification

Classification is a type machine learning task which infers a function from labeled training data. This function is used to predict the label (or class) of testing data. Classification is also called as supervised learning since we use the actual class output (the ground truth) to supervise the output of our classification algorithm. Many classification algorithms have been developed such as tree-based algorithms (C4.5 decision tree, bagging and boosting decision tree, decision stump, boosted stump, random forest etc), neural-network, Support Vector Machine (SVMs), rule-based algorithms(conjunctive rule, RIPPER, PART, PRISM etc), naive bayes, logistic regression and many more.

Continue reading

Hoeffding Tree for Streaming Classification

In the previous post, we have summarized C4.5 decision tree induction. Well, since my thesis is about distributed streaming machine learning, it’s time to talk about streaming decision tree induction and I think it’s better start with defining “streaming machine learning” in general.

Streaming Machine Learning

Streaming machine learning can be interpreted as performing machine learning in streaming setting. In this case, streaming setting is characterized by:

  • High data volume and rate, such as transactions logs in ATM and credit card operations, call log in telecommunication company, and social media data i.e. Twitter tweet stream or Facebook status update stream
  • Unbounded, which means these data always arrive to our system and we won’t be able to fit them in memory or disk for further analysis with the techniques. Therefore, this characteristic implies we are limited to analyse the data once and there is little chance to revisit the data

Continue reading