The popular methods for debugging Hadoop code are:
By using web interface provided by Hadoop framework
By using Counters
Explain what is storage and compute nodes?
The storage node is the machine or computer where your file system resides to store the processing data
The compute node is the computer or machine where your actual business logic will be executed.
Mention what is the use of Context Object?
The Context Object enables the mapper to interact with the rest of the Hadoop
system. It includes configuration data for the job, as well as interfaces which allow it to emit output.
When Namenode is down what happens to job tracker?
Namenode is the single point of failure in HDFS so when Namenode is down your cluster will set off.
Explain how indexing in HDFS is done?
Hadoop has a unique way of indexing. Once the data is stored as per the block size, the HDFS will keep on storing the last part of the data which say where the next part of the data will be.
Explain is it possible to search for files using wildcards?
Yes, it is possible to search for files using wildcards.
List out Hadoop’s three configuration files?
The three configuration files are
core-site.xml
mapred-site.xml
hdfs-site.xml
Explain how can you check whether Namenode is working beside using the jps command?
Beside using the jps command, to check whether Namenode are working you can also use
/etc/init.d/hadoop-0.20-namenode status.
Autodesk Most Frequently Asked Latest Hadoop Interview Questions Answers |
Explain what is “map” and what is "reducer" in Hadoop?
In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
In Hadoop, which file controls reporting in Hadoop?
In Hadoop, the hadoop-metrics.properties file controls reporting.
For using Hadoop list the network requirements?
For using Hadoop the list of network requirements are:
Password-less SSH connection
Secure Shell (SSH) for launching server processes
The next step after Mapper or MapTask is that the output of the Mapper are sorted, and partitions will be created for the output.
Mention what is the number of default partitioner in Hadoop?
In Hadoop, the default partitioner is a “Hash” Partitioner.
Explain what is the purpose of RecordReader in Hadoop?
In Hadoop, the RecordReader loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper.
Mention what job does the conf class do?
Job conf class separate different jobs running on the same cluster. It does the job level settings such as declaring a job in a real environment.
Mention what is the Hadoop MapReduce APIs contract for a key and value class?
For a key and value class, there are two Hadoop MapReduce APIs contract
The value must be defining the org.apache.hadoop.io.Writable interface
The key must be defining the org.apache.hadoop.io.WritableComparable interface
Mention what are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are
Pseudo distributed mode
Standalone (local) mode
Fully distributed mode
Mention what does the text input format do?
The text input format will create a line object that is an hexadecimal number. The value is considered as a whole line text while the key is considered as a line object. The mapper will receive the value as ‘text’ parameter while key as ‘longwriteable’ parameter.
What happens when a datanode fails ?
When a datanode fails
Jobtracker and namenode detect the failure
On the failed node all tasks are re-scheduled
Namenode replicates the users data to another node
Explain what is Speculative Execution?
In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.
Explain what are the basic parameters of a Mapper?
The basic parameters of a Mapper are
LongWritable and Text
Text and IntWritable
Explain what is the function of MapReducer partitioner?
The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same reducer, eventually which helps evenly distribution of the map output over the reducers
Explain what is difference between an Input Split and HDFS Block?
Logical division of data is known as Split while physical division of data is known as HDFS Block
How Hadoop MapReduce works?
In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it aggregates the data as per the document spanning the entire collection. During the map phase the input data is divided into splits for analysis by map tasks running in parallel across Hadoop framework.
Explain what is shuffling in MapReduce ?
The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is known as the shuffle
Explain what is distributed Cache in MapReduce Framework ?
Distributed Cache is an important feature provided by map reduce framework. When you want to share some files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files or simple properties file.
Explain what is NameNode in Hadoop?
NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps the record of all the files in the file system, and tracks the file data across the cluster or multiple machines
If no custom partitioner is defined in Hadoop, then a default partitioner computes a hash value for the key and assigns the partition based on the result.
Explain what happens when Hadoop spawned 50 tasks for a job and one of the task failed?
It will restart the task again on some other TaskTracker if the task fails more than the defined limit.
Mention what is the best way to copy files between HDFS clusters?
The best way to copy files between HDFS clusters is by using multiple nodes and the distcp command, so the workload is shared.
Post a Comment