Dremel – Paper Review

Time is really ticking and somehow this semester I do not able to post as often as last semester.. Well, let’s start posting again.. hehe

I did paper review on Dremel (or here for ACM version) as part of ID2220 (Advanced Topics in Distributed System assignment) and here is the summary of my review. I also attached very nice slides on Dremel done by my classmate, Maria, at the end of this post.

What is Dremel?

Data analytics platform that allows interactive/ad-hoc exploration for web-scale data sets. Continue reading Dremel – Paper Review

last.fm Crawler

Two weeks ago, I, Mario & Zafar had mini project to crawl last.fm’s social graph. We performed Random Walk in the social graph and collected the user data such as age, playcounts and number of playlists. Using the collected data, we estimated the property of last.fm user using simple average and normalized average by the number of friends that user has (node degree).  The detail of the project can be found in Mario’s post, and I attached the project slides for easy reference:



Flume Event Scalability Analysis

This is the follow up of the project in this post. Our professor asked us to perform scalability analysis of the technology that we used, and write simple report on it. I decided to analyze Flume scalability in term of the number of events that can be supported by the Flume configuration. The project itself is inspired from Mike Percy’s Flume-NG performance measurement, and I re-used some of his software components. There are two main differences between this project and Mike’s work, which are:

  •  This experiment introduces one-to-one relationship between the nodes and Flume load generator. Tht means, each Flume load generator process exists in an independent node (which is Amazon EC2 medium instance).
  • This experiment introduces cascading setup, which will verify whether there is improvement in scalability or not compared to non-cascading setup

Flume-based Independent News Aggregator

It has been more than two weeks since my last post! 🙁 I was busy with exam, project reports and my trip preparation. Finally, I manage to find time (on my journey from Warsaw to Prague) to update my blog :p

Well, I would like to cover our SDS project titled “Flume-based Independent News Aggregator” but my project-mate, Mario, has covered as well in his blog. So in this case, I’ll just give you the link to Mario’s post, which is here.

As a follow up of the project, our professor asked us to experiment with the system with regards to its scalability. It is individual project, and I plan to experiment with Flume scalability in term of number of event it can support. Mario will do something related to its realiability and fault tolerance. I plan to update this blog once I finished the project 🙂

*Update: Post about the mini project of Flume Scalability can be found here.

Weekend with Flume – Part 2

After covering some of basic configurations of Flume in Part 1 of Weekend with Flume series, I’ll cover Avro Source, Avro Sink and HDFS sink in this post. Let’s pick a scenario from our school project below. We setup this configuration in Amazon Web Service (AWS), but I will not discuss about our experiences with AWS in this post.

Flume-based News Agregator
Flume-based News Aggregator

Avro Sink Configuration

We have Agent#1 connecting to Collector through a pair of Avro Sink and Avro Source. To achieve this configuration, we have this following flume configuration file in Agent#1. Continue reading Weekend with Flume – Part 2

Weekend with Flume – Part 1

I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume

Apache Flume
Apache Flume

Flume Primer

Quoting definition from Apache’s Flume Wiki:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Continue reading Weekend with Flume – Part 1

Rise of Network Virtualization – Final

This is the follow-up-post of this post about EEDC project this semester. Well, I would like to have more time to polish my slides but that’s fine. It’s already past 🙂

I used historical perspective to discuss about the “rise” of network virtualization, in this context I focused on SDN (Software-Defined Networking) and it is all started in 2007. At that time, network are faster, but not better. It has several limitations such as high complexity in maintaining it, high possibility of policy inconsistent across devices in the network, inability to scale, and dependency on vendor. On the other hand, the need of new network architecture is crucial due to traffic pattern change (not only from client to server or vice versa, but also between nodes in server cluster), consumerization of IT, and rise of cloud service and Big Data. Continue reading Rise of Network Virtualization – Final

Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final

This is my first post in the new domain :). And I would like to present the final report and presentation of Decentralize System (DS) project, which I posted the overview here.

We continued with the survey of Decentralized Storage Systems(DSS), and we ranked them based on these five following characteristics

  1. Availability (AV). Formal definition is fraction of system uptime and able to perform normally over the uptime + downtime. In the DSS context, it is usually reflected as degree of system resistance of churn and fault tolerance level that implemented by the system, and how easy and fast we can get file or data that we want.
  2. Scalability (SC). Ability of the system to be enlarged  to accommodate its growth. In the context of DSS, it could be in term of number of nodes, number of messages exchanged, number of data stored in the storage system.
  3. Eventual Consistency (ECO). ECO is prefered than C, because Availability is more preferred and pure Consistency is not practical and expensive to achieve in term of required messages between nodes.
  4. Performance (P). Example of metrics that can be used to measure performance are response time to satisfy search request. We analyze whether the system put emphasize on it or not.
  5. Security (SE). Resistance into some degrees of attack or breach such as compromising data integrity and unauthorized data access by malicious node. Due to the nature volunteer computing, security is one of the important feature to attract volunteer and ensure data validity

We also identified the focus for each DSS that we survey as shown below. Continue reading Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final

Intelligent Placement of Data Center for Internet Services

It has been a week since my last post :). Well, I was pretty occupied. Dealing with deadlines, and impromptu soccer + Paris trip :p. Well, back to business now. Here is my latest slide for EEDC assignment. It discusses about determining data center location, based on this paper, Intelligent Placement of Datacenters for Internet Services, by I. Goiri et al. It is pretty interesting because this kind of stuff is usually confidential information for the “juggernaut” of internet (Google, Facebook etc).

The writers propose framework to address the placement of data center then they manifested the framework into optimization problem. Continue reading Intelligent Placement of Data Center for Internet Services

Rise of Network Virtualization

I’m planning to cover Network Virtualization topic for my next project in EEDC course. Main inspiration comes from this article from Technology Review, about Nicira. And here is my plan for the task that I present to our professor.

Network Virtualization is the next trend in Virtualization after OS virtualization. It allows user to easily reconfigure a network configuration in a cloud computing environment as well as increase security of the network.  Continue reading Rise of Network Virtualization