Red Hat Most Frequently Asked Latest Hadoop Interview Questions Answers

What Is A Identitymapper And Identityreducer In Mapreduce ?

org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

When Is The Reducers Are Started In A Mapreduce Job?

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

If Reducers Do Not Start Before All Mappers Finish Then Why Does The Progress On Mapreduce Job Shows Something Like Map(50%) Reduce(10%)? Why Reducers Progress Percentage Is Displayed When Mapper Is Not Finished Yet?

Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.
If Datanodes Increase, Then Do We Need To Upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.
Are Job Tracker And Task Trackers Present In Separate Machines?

Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

When We Send A Data To A Node, Do We Allow Settling In Time, Before Sending Another Data To That Node?

Yes, we do.

Does Hadoop Always Require Digital Data To Process?

Yes. Hadoop always require digital data to be processed.

On What Basis Namenode Will Decide Which Datanode To Write On?

As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.

Doesn't Google Have Its Very Own Version Of Dfs?

Yes, Google owns a DFS known as “Google File System (GFS)” developed by Google Inc. for its own use.

What Is The Difference Between Gen1 And Gen2 Hadoop With Regards To The Namenode?

In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.

Can You Explain How Do 'map' And 'reduce' Work?

Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects this key value pairs of all the datanodes and combines them and generates the final output.

What Is 'key Value Pair' In Hdfs?

Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output.

What Is The Difference Between Mapreduce Engine And Hdfs Cluster?

HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce Engine is the programming module which is used to retrieve and analyze data.

Is Map Like A Pointer?

No, Map is not like a pointer.

Do We Require Two Servers For The Namenode And The Datanodes?

Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly configurable system as it stores information about the location details of all the files stored in different datanodes and on the other hand, datanodes require low configuration system.

Why Are The Number Of Splits Equal To The Number Of Maps?

The number of maps is equal to the number of input splits because we want the key and value pairs of all the input splits.

Is A Job Split Into Maps?

No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a map is needed.

Which Are The Two Types Of 'writes' In Hdfs?

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement. It is similar to the today’s courier services. Naturally, non-posted write is more expensive than the posted write. It is much more expensive, though both writes are asynchronous.

Why 'reading' Is Done In Parallel And 'writing' Is Not In Hdfs?

Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency. For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

Can Hadoop Be Compared To Nosql Database Like Cassandra?

Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no DFS in NOSQL. Hadoop is not a database. It’s a file system (HDFS) and distributed programming framework (MapReduce).

How Many Daemon Processes Run On A Hadoop System?

Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM.Following 3 Daemons run on Master nodes

NameNode : This daemon stores and maintains the metadata for HDFS.

Secondary NameNode : Performs housekeeping functions for the NameNode.

JobTracker : Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker.

Following 2 Daemons run on each Slave nodes

DataNode : Stores actual HDFS data blocks.

TaskTracker : Responsible for instantiating and monitoring individual Map and Reduce tasks.

What Is Configuration Of A Typical Slave Node On Hadoop Cluster? How Many Jvms Run On A Slave Node?

Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.
Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.
One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

What Is The Difference Between Hdfs And Nas ?

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

Following are differences between HDFS and NAS

In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations.
HDFS runs on a cluster of machines and provides redundancy using a replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

How Namenode Handles Data Node Failures?

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.
