Tuesday, 15 July 2014

Apache Hadoop

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.





ApplicationMaster - The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

Block Storage Service -  Is a layer in the HDFS: It has two parts

  • Block Management (which is done in Namenode)

    • Provides datanode cluster membership by handling registrations, and periodic heart beats.

    • Processes block reports and maintains location of blocks.

    • Supports block related operations such as create, delete, modify and get block location.

    • Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.

  • Storage - is provided by datanodes by storing blocks on the local file system and allows read/write access.

DataNode (DN) - A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.

An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.

Avoid using NFS for data storage in production system.

Edit Log (EditLog) - This filesystem metadata is stored in two different constructs: the fsimage and the edit log. The fsimage file format is very efficient to read, but it’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability.

FsImage File - The NameNode uses a file in its local host OS file system to store the EditLog. The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.

Hadoop Distributed File System (HDFS) - is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant:

  • HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. 
  • HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
  • HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
  • HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
  • HDFS is now an Apache Hadoop subproject. The project URL is http://hadoop.apache.org/hdfs/.

HBase - is an open source, non-relational, distributed database modeled after Google's BigTable. It is developed as part of Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.

HDFS Federation - Is the new Hadoop HDFS

JobTracker - Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible.

Log Aggregation - The YARN NodeManager provides the option to save logs securely onto a file-system (FS), for e.g. HDFS, after an application completes.

Logs for all the containers belonging to a single Application and that ran on a given NM are aggregated and written out to a single (possibly compressed) log file at a configured location in the FS.



MapReduce - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

  • MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

  • Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (HDFS) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

  • The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

  • Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

  • Although the Hadoop framework is implemented in JavaTMMapReduce applications need not be written in Java:
    • Hadoop Streaming - is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
    • Hadoop Pipes - is a SWIG(Simplified Wrapper and Interface Generator)- compatible C++ API to implement MapReduce applications (non JNI [Java Native Inteface] based).

MapReduce NextGen (YARN aka MRv2) - The new architecture of Hadoop MapReduce. It was introduced in hadoop-0.23, and divides the two major functions of the JobTracker: 1)resource management and 2)job life-cycle management into separate components.

Name Service - A Hadoop service for Namenodes/namespaces.

NameNode - The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.

Namespace - Is a layer in the HDFS:

  • It consists of directories, files and blocks
  • It supports all the namespace related file system operations such as create, delete, modify and list files and directories.



NodeManager (NM) - form the data-computation framework for a node. See also ResourceManager.

OIV (Offline Image Viewer) - The Offline Image Viewer is a tool to dump the contents of hdfs fsimage files to human-readable formats in order to allow offline analysis and examination of an Hadoop cluster's namespace.

POSIX (Portable Operating System Interface) - A family of standards specified by the IEEE for maintaining compatibility between operating systems.

ResourceManager - The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ResourceManager has two main components: Scheduler and ApplicationsManager:

  • The Scheduler is responsible for allocating resources to the various running applications.

  • The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.


splits - The jobtracker divides the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks. These fragments are called "splits" and are encapsulated in instances of the InputSplit interface. When dividing the data into input splits, it is important that this process be quick and cheap. The data itself should not need to be accessed to perform this process (as it is all done by a single machine at the start of the MapReduce job).


TaskTracker - See JobTracker


ViewFs (View File System) - The View File System (ViewFs) provides a way to manage multiple Hadoop file system namespaces (or namespace volumes). It is particularly useful for clusters having multiple namenodes, and hence multiple namespaces, in HDFS Federation. ViewFs is analogous to client side mount tables in some Unix/Linux systems. ViewFs can be used to create personalized namespace views and also per-cluster common views.


WebHDFS - A HTTP REST API for accessing HDFS.





















Hadoop consists of significant improvements over the previous stable release (hadoop-1.x).

Here is a short overview of the improvments to both HDFS and MapReduce.


  • HDFS Federation - In order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don't require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes. More details are available in the HDFS Federation document.

  • MapReduce NextGen aka YARN aka MRv2

    • The new architecture of MapReduceintroduced in hadoop-0.23, divides the two major functions of the JobTracker: 1)resource management and 2)job life-cycle management into separate components.

    • The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚ scheduling and coordination.

    • An application is either a single job in the sense of classic MapReduce jobs or a DAG (Directed Acyclic Graph) of such jobs.

    • The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.

    • The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

    • More details are available in the YARN document.