Apache Hadoop

Apache Hadoop
Name	Apache Hadoop
Developer	Apache Software Foundation
Released	01 April 2006
Programming language	Java (programming language)
Operating system	Cross-platform
Genre	Distributed computing

Contents

Overview
Architecture
Components
Applications
History
Related projects

Apache Hadoop. It is an open-source software framework for distributed storage and processing of very large data sets across clusters of computers using simple programming models. The framework is designed to scale up from single servers to thousands of machines, each offering local computation and storage. It is a foundational technology in the field of big data, enabling reliable, scalable, and distributed computing.

Overview

The project was created by Doug Cutting and Mike Cafarella in 2005, with its name inspired by a toy elephant belonging to Cutting's son. It became a top-level Apache Software Foundation project in January 2008. The core premise is to handle failures at the application layer, providing high availability even when individual nodes or clusters fail. This approach revolutionized data processing by making it feasible and cost-effective to analyze petabyte-scale datasets on commodity hardware. Its development was significantly influenced by papers from Google on the Google File System and MapReduce.

Architecture

The architecture is fundamentally based on a master-slave design. The master node, known as the NameNode, manages the file system namespace and regulates access to files by clients. The slave nodes, or DataNodes, manage storage attached to the nodes they run on. For processing, a master JobTracker manages the distribution of tasks across slave TaskTrackers. This design allows the system to parallelize data processing across the cluster efficiently. The framework assumes that hardware failures are common and handles them automatically, a principle critical for deployments on large-scale, inexpensive hardware clusters.

Components

The ecosystem comprises several key modules. The core includes Hadoop Distributed File System (HDFS) for storage and the MapReduce programming model for processing. YARN (Yet Another Resource Negotiator) is the framework's cluster resource management layer. Other essential related projects within the ecosystem include Apache Hive for data warehousing and SQL-like querying, Apache Pig for high-level data flow scripting, and Apache HBase for non-relational, distributed database capabilities. Libraries like Apache Avro for data serialization and Apache ZooKeeper for coordination are also integral.

Applications

It is widely used across numerous industries for large-scale data analytics. Major technology companies like Yahoo!, Facebook, and LinkedIn were among its earliest and largest adopters, using it for log processing, recommendation systems, and data mining. In finance, institutions apply it for risk modeling and fraud detection. Within scientific research, organizations such as the National Institutes of Health and CERN use it for genomic sequencing and analyzing data from particle colliders. Its ability to process unstructured data like social media feeds from Twitter has also been pivotal.

History

The project originated from the Nutch web search engine project. Cutting and Cafarella, while working on Nutch, encountered scalability issues and were inspired by the 2004 papers from Google on GFS and MapReduce. An initial implementation was part of Nutch, but it was spun off into a separate Apache Software Foundation subproject in 2006. Version 1.0.0 was released in December 2011. A major architectural shift occurred with the introduction of YARN in the 2.x release series, separating resource management from the programming model. This evolution allowed support for processing paradigms beyond MapReduce, such as Apache Spark.

The ecosystem has grown to include a vast array of complementary projects under the Apache Software Foundation. Apache Spark is a prominent in-memory data processing engine that often runs on Hadoop clusters. Apache Flink provides capabilities for stream processing. For data coordination and workflow, there is Apache Oozie. Apache Sqoop is used for transferring data between Hadoop and relational databases, while Apache Flume aggregates log data. Other significant projects include Apache Mahout for machine learning and Apache Impala for real-time querying. Many of these are part of modern commercial distributions from vendors like Cloudera, Hortonworks, and MapR Technologies.

Category:Apache Software Foundation projects Category:Big data Category:Distributed computing architecture Category:Free software programmed in Java Category:2006 software

Overview

Architecture

Components

Applications

History

Related projects