In the previous articles of this blog, we people have seen the need and importance of big data and its application in the IT industry. But there are some problems related to big data. Hence to overcome those problems, we need a framework like Hadoop to process the big data. This article on Hadoop gives you detailed information regarding the problems of big data and how this framework provides the solution to bigdata. Let us discuss all those one by one in detail
Importance of Big data:
Big data is emerging as an opportunity for many organizations. Through big data, analysts today can get the hidden insights of data, unknown correlations, market trends, customer preferences, and other useful business information. Moreover, these big analytics helps organizations in making effective marketing, new revenue opportunities, better customer service. Even though this bigdata has excellent opportunities, there are some problems. Let us have a look at
Problems with Big data:
The main issue of big data is heterogeneous data. It means the data gets generated in multiple formats from multiple sources. i.e data gets generated in various formats like structured, semi-structured, and unstructured. RDBMS mainly focuses on structured data like baking transactions, operation data, and so on. Since we cannot expect the data to be in a structured format, we need a tool to process this unstructured data. And there are ‘n’ number of problems with big data. Let us discuss some of them.
Storing this huge data in the traditional databases is not practically possible. Moreover, in traditional databases, stores will be limited to one system where the data is increasing at a tremendous rate.
b)Data gets generated in heterogeneous amounts:
In traditional databases, data is presented in huge amounts. Moreover, data gets generated in multiple formats. This may be structured, semi-structured, and unstructured. So you need to make sure that you have a system that is capable of storing all varieties of data generated from various sources.
This is a major drawback of leaving the traditional databases. i.e accessibility rate is not proportional to the disk storage. So w.r.t to data increment, access rate is not increasing. Moreover, since all formats of data present at a single place, the accessibility rate will be inversely proportional to data increment.
Then Hadoop came into existence to process the unstructured data like text, audios, videos, etc. But before going to know about this framework, let us have an initially have a look at the evolution
The evolution of the Hadoop framework has gone through various stages in various years as follows:
a)2003- Douge cutting launches project named nutch to handle billions of searches and indexes millions of web pages. Later in this year, Google launches white papers with Google File Systems(GFS)
b)2004 – In December, Google releases the white paper with Map Reduce
c)2005 - Nutch uses GFS and Map Reduce to perform operations
d)2006 - Yahoo created Hadoop based on GFS and Map Reduce with Doug cutting and team.
e)2007 - Yahoo started using Hadoop on a 1000 node cluster.
f)2008 - yahoo released Hadoop as an open-source project to an apache software foundation. Later in July 2008, apache tested a 4000 node with Hadoop successfully
g)2009 – Hadoop successfully stored a petabyte of data in less than 17 hrs to handle billions of searches and index millions of webpages.
From then it has been releasing various versions to handle billions of web pages.
So till now, we people have discussed regarding the evolution, now lets us move into the actual concept
What is Hadoop?
Hadoop is a framework to store big data to process the data-parallelly in a distributed environment. This framework is capable of storing data and running applications on the clusters of commodity hardware. This framework was written in JAVA. It is capable of batch processing. Besides this framework is capable of providing massive storage for any kind of data with enormous computing power. Moreover, it is also capable of handling virtually limitless tasks (or) jobs. This framework is capable of efficiently storing and processing large datasets from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers, to analyze the massive data sets in parallel more quickly.
Here the data is stored on inexpensive commodity servers that run as a cluster. Its distributed file system enables concurrent processing and fault tolerance. This framework uses map reducing programming model for faster data storage and its retrieval from its nodes. Today many applications were generating the big data to be processed, where the Hadoop plays a significant role in providing a much-needed makeover to the database world.
Get more information on big data by live experts at Hadoop Online Training
This framework has four components as mentioned below:
This stands for Hadoop Distributed File Processing Systems. This framework allows you to store data of various formats across the cluster. This component creates the abstraction. Like Virtualization, you can see HDFS, as a single unit for storing big data. This framework uses a master-slave architecture. In HDFS is Name node is the master node and Data nodes is the Slave node. Name node contains the metadata about the data stored in Data nodes such as which data block is stored in which data node. Here the actual data is stored in data nodes. Moreover, this framework has a default replication factor of 3. Hence due to the utilization of commodity hardware, if one of the data nodes fails, HDFS will still have a copy of the lost data blocks. Moreover, this component also allows you to configure the replication factor based on your requirements.
YARN stands for Yet Another Resource Negotiator. It is a Hadoop resource management. This component acts as an OS to the Hadoop. This file system is built on the top of HDFS. It is responsible for managing the cluster resources to make sure that you don’t overload one machine. It performs all your processing activities by allocating the resources and scheduling the tasks. It has two major components i.e Resource Manager and Node Manager. Here the Resource Manager is again a master node. Here the Node Managers were installed on every Data Node. It is responsible for the execution of the task on every single data node. In the node section, each node has its node managers. Here the node manager manages the nodes and monitors the resource usage in the nodes. It receives the processing request and then passes the parts of the request to the corresponding node managers. Here the actual processing of the data takes place. Here the containers contain the collection of physical resources like RAM, CPU (or) the hard drives.
a)It is a framework that helps the JAVA programs to do the parallel computation of data using Key-value pairs. Here the map is responsible for taking the input data and converts into the dataset that can be computed in a key-value pair. Here the output of the Map is consumed by the reducer, where there the reducer gives the desired result. So in Map-Reduce approach, the processing is done at slave nodes and the final result is sent to the master node. Moreover, the data containing the code is responsible to process the entire data. Here the coded data is small when compared to the actual data. Here the code to process the data inform of Kilobytes. Here the input is divided into small groups of data called Data Chunks.
Likewise, each component of this framework has its own function in processing big data. You people can get the practical working of this framework by live experts with practical use cases at Hadoop Online Course
By reaching the end of this blog, I hope you people have got on Hadoop and application in the IT industry. In the upcoming post of the blog, I'll be sharing with you the details on Hadoop architecture and its working. Meanwhile, have a look at our Hadoop Interview Questions and get placed in a reputed firm