Bootstrapping Twitter Storm

I have chances to use Twitter Storm for my thesis and in this post I would like  to give some pointers about it. I hope this will be useful for those who are starting to use Storm in their project 🙂

Of course I am NOT talking about this movie :D
Of course I am NOT talking about this movie 😀

Well, I tried to search for Twitter Storm logo, but I could not find it. Then suddenly I remembered about the movie pictured above. Okay, let’s get back to business.

What is Twitter Storm?

Twitter Storm is a distributed streaming computation framework. It does, for real-time-processing(via streaming), what Hadoop’s MapReduce (MR) does for batch processing. The main reason why it exists is in inflexibility of Hadoop MR in handling stream processing, i.e. it’s too complex and error-prone to configure Hadoop MR in handling streaming data (for more detail, watch the first five minutes of this video).

Fundamental Concept

I do no want to re-invent the wheel, so please explore directly to Concepts section in Storm documentation.  I recommend you to go to Parallelism section below in order to gain more understanding of  tasks and workers. One particular aspect that I want to highlight is: Storm using pull model, i.e. Spout’s nextTuple method is invoked by Storm engine to pull the data to a destination bolt. And it is the same case with Bolt’s execute method.

Sample Code

For those who prefer to see some codes in using Storm, go to storm-starter project and clone them to your local machine. Some important storm-starter classes to speed-up your learning are

  1. ExclamationTopology. This is the most basic example of Storm. The topology consists of 1 spout and 2 bolts: TestWordSpout -> ExclamationBolt -> ExclamationBolt. And both connections are using shuffle-grouping. Refer to line 48 to line 52 for the details in creating and connecting the components. Take a look also for the implementation of ExclamationBolt.
  2. RandomSentenceSpout. An example of spout implementation, note that the data source is internal to the Spout, i.e. the data source is just an array of sentences initialized inside the spout.

Once you’re familiar and comfortable with basic spout, bolt and topology implementation, I suggest you to start creating your own Storm components and execute your topology. A small note for storm-starter project: if you want to use Maven, rename M2-pom.xml to pom.xml and import the Maven project to you favourite editor.

Another alternative is to check other toy projects/applications in Storm such as storm-word-count. This example contains spout that uses Twitter4J to consume Twitter stream. Storm’s tick feature (available in 0.8+) is also used in one of the bolts to provide periodical action/behavior.

Parallelism

Ones does not simply say that Storm parallelism is only around configuring the number of bolt or spout in the topology. There are much more. In order to be able to execute your Storm topology effectively, you need to understand the concept of parralelism hint, executor, worker, task. You need to be able to answer these following questions: What are they? Which one is the process? which one is the thread? what components that spouts and bolts correspond that?

But no worry, Michael Noll discussed about Storm parallelism concept in his awesome post, which eventually transferred into Storm documentation. Check them out!

Storm Development Environment

Now we are going to discuss about development environment. In general, Storm has two modes: local and cluster. In local mode, you need to use LocalCluster class and you don’t need to setup anything in your development machine. The topology will be executed in a Java process in your machine. This mode is useful to perform initial testing of your topology.

In cluster mode, here are the steps (assuming you already have a Storm cluster running):

  1. Download a storm release, unpack it, and put the bin folder on your path directory.
  2. Put cluster information in ~/.storm/storm.yaml. At the minimum you need to setup “nimbus.host” configuration of the .yaml file.
  3. Make sure you use StormSubmitter class in your code, as shown in this example.

Wait, in step 2, I mentioned about “nimbus”. Well, what is it actually? Quoting from Storm wiki page (since I don’t want to re-invent the wheel and keep the brevity of this post ;))

A Storm cluster is managed by a master node called “Nimbus”. Your machine communicates with Nimbus to submit code (packaged as a jar) and topologies for execution on the cluster, and Nimbus will take care of distributing that code around the cluster and assigning workers to run your topology. Your machine uses a command line client called “storm” to communicate with Nimbus. The “storm” client is only used for remote mode; it is not used for developing and testing topologies in local mode.

At this stage, it’s good to know also about the topology lifecycle as explained in this Storm wiki page.

Setting up Storm cluster

Use Michael Noll’s Wirbelstrum, or follow his tutorial here.

Without further ado, follow this page from Storm wiki. During installation, I found that these following steps are time consuming because the instructions in the wiki does not cover a case where you do not have administration access to your cluster.

  1. Setup ZeroMQ. Remember to use ZeroMQ version 2.1.7 which you download from here. One important tips here is to call “./configure” with prefix options (“./configure –prefix=”$PREFIX_DIR”, set $PREFIX_DIR with the desired installation path). Prefix options specify where ZeroMQ put the library files from compilation, which means you do not need to have administrator privilege to setup Storm cluster.
  2. Setup JZMQ. For this step, you need to download and build JZMQ from here because Storm only tested to work with this specific version of JZMQ. Again, use “–prefix” option and “–with-zeromq” option in your ./configure script. “–prefix” works the same way as zeromq setup. “–with-zeromq” flag specifies the path where zeromq is installed in your system.

A small note on Zookeeper: Storm uses library from ZooKeeper 3.3.3, so it is recommended to use ZooKeeper 3.3.3 to minimize the surprise due to different ZooKeeper version. Refer to this thread for more info.

Configuration

Once you have a Storm cluster and your Hello-World program running, it’s time to do some tweaking. Explore the wiki page about Configuration and check the default values for storm.yaml file so you know that are the available options.

Wrap up!

I think that’s enough for now! The deadline for thesis is approaching, therefore for the following posts, I think I’ll write some part of thesis, especially SAMOA design and  the parallel classifier that implemented on top of SAMOA. Ciaoooo!!

 

One thought on “Bootstrapping Twitter Storm”

Leave a Reply