The Curious Case of Consistency – Part 1

Last week I attended interesting breakfast talk by Doug Terry, principal researcher from Microsoft Research Silicon Valley and I think it is pretty cool talk B-). The talk itself is about other consistency types which lie between Strong and Eventual Consistency, and how the additional consistency types are (practically) used through simple pseudo-code of baseball game. And this is the first of two planned posts for this topic.

Let’s start with two most basic consistency model (which we usually learn in our first Distributed System course/module in University): Strong Consistency and Eventual Consistency.

In Strong Consistency, every node in distributed system will always see the same and latest view after a specific node update the data. This consistency type sacrifices performance and availability in order to ensure consistent view across the node in the distributed system since it is expensive in term of network and computing resources needed to maintain the consistency. Example of system that uses Strong Consistency is Windows Azure.  Strong Consistency is desirable within a data center however performance and availability concerns start raising in geo-replication service that spans in multiple-data center.

I use strong consistency!

Continue reading The Curious Case of Consistency – Part 1

Flume Event Scalability Analysis

This is the follow up of the project in this post. Our professor asked us to perform scalability analysis of the technology that we used, and write simple report on it. I decided to analyze Flume scalability in term of the number of events that can be supported by the Flume configuration. The project itself is inspired from Mike Percy’s Flume-NG performance measurement, and I re-used some of his software components. There are two main differences between this project and Mike’s work, which are:

  •  This experiment introduces one-to-one relationship between the nodes and Flume load generator. Tht means, each Flume load generator process exists in an independent node (which is Amazon EC2 medium instance).
  • This experiment introduces cascading setup, which will verify whether there is improvement in scalability or not compared to non-cascading setup

Weekend with Flume – Part 2

After covering some of basic configurations of Flume in Part 1 of Weekend with Flume series, I’ll cover Avro Source, Avro Sink and HDFS sink in this post. Let’s pick a scenario from our school project below. We setup this configuration in Amazon Web Service (AWS), but I will not discuss about our experiences with AWS in this post.

Flume-based News Agregator
Flume-based News Aggregator

Avro Sink Configuration

We have Agent#1 connecting to Collector through a pair of Avro Sink and Avro Source. To achieve this configuration, we have this following flume configuration file in Agent#1. Continue reading Weekend with Flume – Part 2

Weekend with Flume – Part 1

I was in geeky mode this weekend, spending most of my time configuring Flume for our SDS project. I’ll share some observation and tricks that our group did in configuring Flume

Apache Flume
Apache Flume

Flume Primer

Quoting definition from Apache’s Flume Wiki:

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Continue reading Weekend with Flume – Part 1

Rise of Network Virtualization – Final

This is the follow-up-post of this post about EEDC project this semester. Well, I would like to have more time to polish my slides but that’s fine. It’s already past 🙂

I used historical perspective to discuss about the “rise” of network virtualization, in this context I focused on SDN (Software-Defined Networking) and it is all started in 2007. At that time, network are faster, but not better. It has several limitations such as high complexity in maintaining it, high possibility of policy inconsistent across devices in the network, inability to scale, and dependency on vendor. On the other hand, the need of new network architecture is crucial due to traffic pattern change (not only from client to server or vice versa, but also between nodes in server cluster), consumerization of IT, and rise of cloud service and Big Data. Continue reading Rise of Network Virtualization – Final

Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final

This is my first post in the new domain :). And I would like to present the final report and presentation of Decentralize System (DS) project, which I posted the overview here.

We continued with the survey of Decentralized Storage Systems(DSS), and we ranked them based on these five following characteristics

  1. Availability (AV). Formal definition is fraction of system uptime and able to perform normally over the uptime + downtime. In the DSS context, it is usually reflected as degree of system resistance of churn and fault tolerance level that implemented by the system, and how easy and fast we can get file or data that we want.
  2. Scalability (SC). Ability of the system to be enlarged  to accommodate its growth. In the context of DSS, it could be in term of number of nodes, number of messages exchanged, number of data stored in the storage system.
  3. Eventual Consistency (ECO). ECO is prefered than C, because Availability is more preferred and pure Consistency is not practical and expensive to achieve in term of required messages between nodes.
  4. Performance (P). Example of metrics that can be used to measure performance are response time to satisfy search request. We analyze whether the system put emphasize on it or not.
  5. Security (SE). Resistance into some degrees of attack or breach such as compromising data integrity and unauthorized data access by malicious node. Due to the nature volunteer computing, security is one of the important feature to attract volunteer and ensure data validity

We also identified the focus for each DSS that we survey as shown below. Continue reading Large-Scale Decentralized Storage Systems for Volunter Computing Systems – Final

Intelligent Placement of Data Center for Internet Services

It has been a week since my last post :). Well, I was pretty occupied. Dealing with deadlines, and impromptu soccer + Paris trip :p. Well, back to business now. Here is my latest slide for EEDC assignment. It discusses about determining data center location, based on this paper, Intelligent Placement of Datacenters for Internet Services, by I. Goiri et al. It is pretty interesting because this kind of stuff is usually confidential information for the “juggernaut” of internet (Google, Facebook etc).

The writers propose framework to address the placement of data center then they manifested the framework into optimization problem. Continue reading Intelligent Placement of Data Center for Internet Services