What is Hadoop?
It is a word which is used very often nowadays in the IT as well as some other industry sectors. It is often used in connection with Big Data. But very few know how it is related to Big Data.
Big data is the voluminous heterogeneous data in terabytes and Hadoop is a framework (not a single technology or a product) to process efficiently Big Data. Hadoop is a part of an Ecosystem to handle big data which comprises of various other supporting technologies and products.
Hadoop is an Apache open source framework written in java which has the potential to manage thousands of terabytes of data. It has quickly developed as basis for big data processing tasks like statistical analytics, business planning and processing huge volumes of data from sensors including IoT sensors. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.
The main four modules of Hadoop Framework
- Hadoop Distributed File System (HDFS) – A distributed file system which is capable of storing data across thousands of servers and provides access to application data.
- Hadoop Yet Another Resource Negotiator (YARN) – A framework for resource management and schedule jobs across the clusters holding data.
- Hadoop Map Reduce- It is a parallel processing system which is YARN based and maps the data and reduces to a result.
- Hadoop Common- They are the Java libraries and utilities supporting other modules.
The other softwares which are part of the Hadoop Ecosystem
- Apache Flume– It is meant for Data Movement. It streams logs into Hadoop. It is a reliable & distributed service to efficiently collect, aggregate and move large amounts of streaming data into HDFS
- Apache Hive-It is a Data Warehouse Infrastructure on top of Hadoop for providing summaries, query and analysis. It allows SQL programmer to SQL type of statements in HQL
- Apache Pig-It is a high level platform for creating program on Hadoop to analyze large data sets.
- Apache Hbase– It is an open source, non-relational, column oriented database management system which runs on top of HDFS. It is written in Java and modeled after Google’s Big Table.
- Apache Spark-It is a powerful open source which uses its standalone cluster mode and can run not only on Hadoop Yarn but also on Apache Mesos and Cloud.
- Apache Zookeeper, Apache Oozie, Apache Sqoop, Cloudera Impala and few others.
Advantages of Hadoop
- It is scalable
- Hadoop framework allows the user to quickly write and test distributed systems and utilizes the underlying parallelism of the CPU cores.
- Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library itself has been designed to detect and handle failures at the application layer.