Flume Event Scalability Analysis

This is the follow up of the project in this post. Our professor asked us to perform scalability analysis of the technology that we used, and write simple report on it. I decided to analyze Flume scalability in term of the number of events that can be supported by the Flume configuration. The project itself is inspired from Mike Percy’s Flume-NG performance measurement, and I re-used some of his software components. There are two main differences between this project and Mike’s work, which are:

  •  This experiment introduces one-to-one relationship between the nodes and Flume load generator. Tht means, each Flume load generator process exists in an independent node (which is Amazon EC2 medium instance).
  • This experiment introduces cascading setup, which will verify whether there is improvement in scalability or not compared to non-cascading setup

Flume-based Independent News Aggregator

It has been more than two weeks since my last post! 🙁 I was busy with exam, project reports and my trip preparation. Finally, I manage to find time (on my journey from Warsaw to Prague) to update my blog :p

Well, I would like to cover our SDS project titled “Flume-based Independent News Aggregator” but my project-mate, Mario, has covered as well in his blog. So in this case, I’ll just give you the link to Mario’s post, which is here.

As a follow up of the project, our professor asked us to experiment with the system with regards to its scalability. It is individual project, and I plan to experiment with Flume scalability in term of number of event it can support. Mario will do something related to its realiability and fault tolerance. I plan to update this blog once I finished the project 🙂

*Update: Post about the mini project of Flume Scalability can be found here.

Weekend with Flume – Part 2

After covering some of basic configurations of Flume in Part 1 of Weekend with Flume series, I’ll cover Avro Source, Avro Sink and HDFS sink in this post. Let’s pick a scenario from our school project below. We setup this configuration in Amazon Web Service (AWS), but I will not discuss about our experiences with AWS in this post.

Flume-based News Agregator
Flume-based News Aggregator

Avro Sink Configuration

We have Agent#1 connecting to Collector through a pair of Avro Sink and Avro Source. To achieve this configuration, we have this following flume configuration file in Agent#1. Continue reading Weekend with Flume – Part 2

Weekend with Flume – Part 1

I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume

Apache Flume
Apache Flume

Flume Primer

Quoting definition from Apache’s Flume Wiki:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Continue reading Weekend with Flume – Part 1

Apache Flume

This time, our group needed to prepare presentation about Apache Flume for EEDC homework. Flume is intended to solve challenges in safely transferring huge set of data from a node (example: log files in company web servers) to data store (example: HDFS, Hives, HBase, Cassandra etc etc).

Apache Flume
Apache Flume

Well, for a simple system with relatively small data set, we usually customize our own solution to do this job, such as to create some script to transfer the log to database. However, this kind of ad-hoc solution is difficult to make it scalable because usually it is created very tailored into our system. It sometimes suffers from problem in manageability, especially when the original programmer or engineer who created the system left the company. It is also often difficult to extend and, furthermore it may have problem in reliability due to some bugs during the implementation.

And Apache Flume comes into the rescue!!! Continue reading Apache Flume