Wednesday, December 16, 2015

MapReduce

MapReduce is mainly used for parallel processing of large sets of data. Initially, it is a hypothesis specially designed by Google to provide parallelism, data distribution and fault-tolerance. MR processes data in the form of key-value pairs. A key-value (KV) pair is a mapping element between two linked data items - key and its value.

The key(K) acts as an identifier to the value. An example of a key-value (KV) pair is a pair where the key is the node Id and the value is its properties including neighbor nodes, predecessor node etc. MR API provides the following features like batch processing, parallel processing of huge amounts of data and high availability.
For effective scheduling of work, Hadoop provides specific features at the architecture level. They are Fault tolerance, Rack Awareness and Replication Factor. As compared to two native UNIX/LINUX (8 to 16 KB) environment, the Block size in Hadoop by default is 64 MB. There is a provision to change to 128 MB. The Replication Factor by default is 3. 
But it depends on the business requirement. We can increase/ decrease the replication factor. Compared to disk blocks, HDFS blocks are larger in size, so it will decrease the costs of six.

Hadoop Requires Java Runtime Environment (JRE) 1.6 or higher, because Hadoop is developed on top of Java APIs. Hadoop work as low level single node to high level multi node cluster Environment.
The master/slave architecture manages mainly two types of functionalities in HDFS. They are file management and I/O. We can call the master program as Name Node and the slave programs are called Data Nodes. An HDFS cluster consists of a single namenode, a master server that manages the file system namespace and regulates access to files by clients.
The namenode operates on file system namespace operations such as opening and closing files, etc. In the presence of a cluster of machines, a dedicated machine runs the Name Node, which is the arbitrator of the Data Nodes and the repository of HDFS metadata.

No comments: