Basics of Hadoop
This post deep dive into the basics of Hadoop, what problems it solves, and what some other technologies are built on top of it
What is Hadoop?
As per Wikipedia, “Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.”
In terms of storage, Hadoop manages data up to terabytes or even petabytes (1 Petabyte = 1000 Terabytes). Hadoop is designed to scale from one machine to thousands of machines forming multiple clusters with each machine offering local computation and storage.
This framework was first designed by Apache and hence it’s called Apache Hadoop.
Please note that using Hadoop is a very generic word these days. Hadoop is a big open-source framework that is used as baseline and various technologies are built on top of Hadoop or can be used alongside Hadoop such as Hive, HBase, Spark, Zookeeper, Storm, etc.
4 Key Hadoop Modules
Four key modules comprise the primary Hadoop framework and work collectively to form the Hadoop ecosystem:
Hadoop Distributed File System: HDFS is the primary storage system used by Hadoop applications. HDFS is a distributed file system in which every individual node performs operations on its local storage dataset. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on the high throughput of data access rather than the low latency of data access.
Yet Another Resource Negotiator: YARN is a resource-management platform responsible for managing the compute resources(for example: RAM, CPU, GPU, etc) across the clusters and using these resources to efficiently schedule users’ applications across the Hadoop ecosystem.
MapReduce: This is a programming model for processing data at a large scale. In MapReduce, small subsets of a large dataset and instructions for processing the subsets are sent to multiple machines in the Hadoop cluster. Each node processes the subset of data along with the instructions sent to it parallelly along with other nodes and then combines the dataset into a smaller more manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules. Think of this like a common set of code, libraries which can be used by HDFS, YARN and MapReduce.
While the primary Hadoop framework is written in Java language, you can use Hadoop to write MapReduce programs in other languages like C++/Python and then Hadoop will take care of converting your code to Java understandable format and then interact with the core framework
How does Hadoop Work?
Hadoop distributes the datasets across a cluster of commodity hardware. Commodity hardware is any physical hardware that can be easily replaced with any other hardware irrespective of its origin or manufacturer. Processing is performed in parallel on multiple servers simultaneously.
Software clients input data into Hadoop. HDFS handles metadata and the distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework.
HDFS Architecture
Block Ops(operations performed on the block - a single unit of storage in HDFS)
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Let’s try to understand each component of the HDFS architecture:
1. NameNode (Master):
The NameNode acts as the central authority (like the brain), managing the file system namespace. This namespace directory structure keeps track of all files and folders within the HDFS. It regulates access to files by clients, ensuring data security.
It doesn't store actual data itself; it just manages where data resides on DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories
2. DataNode (Slaves):
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. Also, the DataNodes report back to the NameNode periodically, keeping it updated on block health.
3. Blocks:
Files in HDFS are split into fixed-size blocks (configurable) for efficient storage and distribution. Blocks are the fundamental unit of storage in HDFS.
4. Replication:
To ensure fault tolerance and data availability, HDFS replicates each block across multiple DataNodes. The number of replicas (replication factor) is configurable. In case a DataNode fails, the data remains accessible from other replicas. That’s why, HDFS is considered to be fault-tolerant.
The NameNode makes all decisions regarding the replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
5. Secondary NameNode (Optional):
Secondary NameNode is used for taking the hourly backup of the data. In case the Hadoop cluster fails, or crashes, the secondary Namenode will take the hourly backup or checkpoints of that data and store this data in a file name fsimage. This file then gets transferred to a new system.
FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.
A typical file in HDFS is gigabytes to terabytes in size and thus HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.
A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. “HDFS provides interfaces for applications to move themselves closer to where the data is located”.
Hadoop Tools
Many open-source tools extend the capabilities of the core Hadoop modules:
Apache Hive: The Apache Hive is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in HDFS using HiveQL(Hive Query Language), which is similar to SQL.
Apache HBase: Apache HBase is the Hadoop database, a distributed, scalable, big data store. This is a NoSQL database that sits on top of HDFS. It's built for fast read/write access to large datasets, especially when you need to access specific records quickly. HBase uses a column-oriented format for data storage, allowing for efficient retrieval of individual records.
While HDFS is a distributed file system designed for storing large files across clusters of machines, HBase is designed to quickly find specific records within those files.
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze large sets of data and enables functions like filter, sort, load, and join
Apache Zookeeper: An open-source server that enables reliable distributed coordination in Hadoop; a service for, "maintaining configuration information, naming, providing distributed synchronization, and providing group services"
Apache Oozie: This is a workflow management system that helps schedule and coordinate Hadoop jobs. It allows you to define dependencies between jobs and automate the execution process.
Presto(Not a part of Hadoop Tech Umbrella): Acts as a distributed SQL query engine. It can sit on top of Hadoop, allowing users to run interactive SQL queries against data stored in HDFS. Presto doesn't manage its own storage system and relies on connectors to access data from various sources, including HDFS.
Benefits of using Hadoop
Scalability: Hadoop can easily scale up by adding more nodes to a cluster, allowing it to handle ever-growing data volumes.
Cost-Effectiveness: Hadoop leverages commodity hardware, making it a relatively inexpensive solution for big data storage and processing.
Fault Tolerance: Hadoop allows for fault tolerance and system resilience, meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on one Hadoop cluster is replicated across other nodes within the system to fortify against the possibility of hardware or software failure.
Flexibility: Hadoop allows for flexibility in data storage as data does not require preprocessing before storing it which means that an organization can store as much data as they like and then utilize it later.
That’s it, folks for this edition of the newsletter. Please consider liking and sharing with your friends as it motivates me to bring you good content for free. If you think I am doing a decent job, share this article in a nice summary with your network. Connect with me on Linkedin or Twitter for more technical posts in the future!
Book exclusive 1:1 with me here.
Resources
HDFS Architecture by Apache
What is Hadoop by Google Cloud
Apache Hadoop by Wikipedia
Hadoop Distributed File System by Databricks