Posts

Showing posts from May, 2018

Hadoop MapReduce Interview Questions

Hadoop MapReduce Interview Questions Definition: MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment. MapReduce 2.0 or YARN Architecture: MapReduce framework also follows Master/Slave Topology where the master node (Resource Manager) manages and tracks various MapReduce jobs being executed on the slave nodes (Node Mangers). Resource Manager consists of two main components: Application Manager:  It accepts job-submissions, negotiates the container for ApplicationMaster and handles failures while executing MapReduce jobs. Scheduler:  Scheduler allocates resources that is required by various MapReduce application running on the   Hadoop cluster. How MapReduce job works: As the name MapReduce suggests, reducer phase takes place after mapper phase has been completed. So, the first is the map job, where a block of data is read and processed to produce key-v...

What is the difference between traditional RDBMS and Hadoop?

RDBMS Hadoop Data Types RDBMS relies on the structured data and the schema of the data is always known. Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured. Processing RDBMS provides limited or no processing capabilities. Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion. Schema on  Read Vs. Write RDBMS is based on ‘schema on write’ where schema validation is done before loading the data. On the contrary, Hadoop follows the schema on read policy. Read/Write Speed In RDBMS, reads are fast because the schema of the data is already known. The writes are fast in HDFS because no schema validation happens during HDFS write. Cost Licensed software, therefore, I have to pay for the software. Hadoop is an open source framework. So, I don’t need to pay for the software. Best Fit Use Case RDBMS is used for OLTP (Online Trasanctional Processing) system. Hadoop is used for Data discovery, dat...

Daemon Services of Hadoop-

Image
Daemon Services of Hadoop- Namenodes Secondary Namenodes Jobtracker Datanodes Tasktracker Above three services 1, 2, 3 can talk to each other and other two services 4,5 can also talk to each other. Namenode and datanodes are also talking to each other as well as Jobtracker and Tasktracker are also. Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns...

JobTracker and TaskTracker

JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and replaced by Resource Manager, Application Master and Node Manager Daemons. Job Tracker – JobTracker process runs on a separate node and  not  usually on a DataNode. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by ResourceManager/ApplicationMaster in MRv2. JobTracker receives the requests for MapReduce execution from the client. JobTracker talks to the NameNode to determine the location of the data. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node. JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client. JobTracker process is critical to the Hadoop cluster in terms of Map...

Apache Hadoop HDFS Architecture

Apache Hadoop HDFS Architecture Introduction: In this blog, I am going to talk about Apache Hadoop HDFS Architecture.you already know that HDFS is a distributed file system which is deployed on low cost commodity hardware. So, it’s high time that we should take a deep dive into Apache Hadoop HDFS Architecture and unlock its beauty. HDFS Architecture: Apache HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave Architecture , where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines that support Java. Though one can run several DataNodes on a single machine, but in the practical world, these Da...