The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
ApplicationMaster - The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.
Block Storage Service - Is a layer in the HDFS: It has two parts
- Block Management (which is done in Namenode)
- Provides datanode cluster membership by handling registrations, and periodic heart beats.
- Processes block reports and maintains location of blocks.
- Supports block related operations such as create, delete, modify and get block location.
- Manages replica placement and replication of a block for under replicated blocks and deletes blocks that are over replicated.
- Storage - is provided by datanodes by storing blocks on the local file system and allows read/write access.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow every TaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
- HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
- HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
- HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
- HDFS was originally built as infrastructure for the Apache Nutch web search engine project.
- HDFS is now an Apache Hadoop subproject. The project URL is http://hadoop.apache.org/hdfs/.
HBase - is an open source, non-relational, distributed database modeled after Google's BigTable. It is developed as part of Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop.
HDFS Federation - Is the new Hadoop HDFS
Log Aggregation - The YARN NodeManager provides the option to save logs securely onto a file-system (FS), for e.g. HDFS, after an application completes.
Logs for all the containers belonging to a single Application and that ran on a given NM are aggregated and written out to a single (possibly compressed) log file at a configured location in the FS.
MapReduce - A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
- A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
- Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (HDFS) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.
- The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.
- Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
- Although the Hadoop framework is implemented in JavaTM, MapReduce applications need not be written in Java:
- Hadoop Streaming - is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
- Hadoop Pipes - is a SWIG(Simplified Wrapper and Interface Generator)- compatible C++ API to implement MapReduce applications (non JNI [Java Native Inteface] based).
Name Service - A Hadoop service for Namenodes/namespaces.
NameNode - The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Namespace - Is a layer in the HDFS:
- It consists of directories, files and blocks
- It supports all the namespace related file system operations such as create, delete, modify and list files and directories.
NodeManager (NM) - form the data-computation framework for a node. See also ResourceManager.
POSIX (Portable Operating System Interface) - A family of standards specified by the IEEE for maintaining compatibility between operating systems.
ResourceManager - The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The ResourceManager has two main components: Scheduler and ApplicationsManager:
- The Scheduler is responsible for allocating resources to the various running applications.
- The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
ViewFs (View File System) - The View File System (ViewFs) provides a way to manage multiple Hadoop file system namespaces (or namespace volumes). It is particularly useful for clusters having multiple namenodes, and hence multiple namespaces, in HDFS Federation. ViewFs is analogous to client side mount tables in some Unix/Linux systems. ViewFs can be used to create personalized namespace views and also per-cluster common views.
WebHDFS - A HTTP REST API for accessing HDFS.
Hadoop consists of significant improvements over the previous stable release (hadoop-1.x).
Here is a short overview of the improvments to both HDFS and MapReduce.
- HDFS Federation - In order to scale the name service horizontally, federation uses multiple independent Namenodes/Namespaces. The Namenodes are federated, that is, the Namenodes are independent and don't require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes. More details are available in the HDFS Federation document.
- MapReduce NextGen aka YARN aka MRv2
- The new architecture of MapReduceintroduced in hadoop-0.23, divides the two major functions of the JobTracker: 1)resource management and 2)job life-cycle management into separate components.
- The new ResourceManager manages the global assignment of compute resources to applications and the per-application ApplicationMaster manages the application‚ scheduling and coordination.
- An application is either a single job in the sense of classic MapReduce jobs or a DAG (Directed Acyclic Graph) of such jobs.
- The ResourceManager and per-machine NodeManager daemon, which manages the user processes on that machine, form the computation fabric.
- The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
- More details are available in the YARN document.