December 22, 2018

Srikaanth

Hitachi Most Frequently Asked Latest Hadoop Interview Questions Answers

Which Are The Methods In The Mapper Interface?

The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. All above methods you can override in your code.

What Happens If You Don't Override The Mapper Methods And Keep Them As It Is?

If you do not override any methods (leaving even map as-is), it will act as the identity function, emitting each input record as a separate output.

What Is The Use Of Context Object?

The Context object allows the mapper to interact with the rest of the Hadoop system. It Includes configuration data for the job, as well as interfaces which allow it to emit output.
Hitachi Most Frequently Asked Latest Hadoop Interview Questions Answers
Hitachi Most Frequently Asked Latest Hadoop Interview Questions Answers

How Can You Add The Arbitrary Key-value Pairs In Your Mapper?

You can set arbitrary (key, value) pairs of configuration data in your Job, e.g. with
Job.getConfiguration().set("myKey", "myVal"), and then retrieve this data in your mapper with
Context.getConfiguration().get("myKey"). This kind of functionality is typically done in the Mapper's setup() method.

How Does Mapper's Run() Method Works?

The Mapper.run() method then calls map(KeyInType, ValInType, Context) for each key/value pair in the InputSplit for that task

Which Object Can Be Used To Get The Progress Of A Particular Job ?

Context

What Is Next Step After Mapper Or Maptask?

The output of the Mapper are sorted and Partitions will be created for the output. Number of partition depends on the number of reducer.

How Can We Control Particular Key Should Go In A Specific Reducer?

Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioned.

What Is The Use Of Combiner?

It is an optional component or class, and can be specify via Job.setCombinerClass(ClassName), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

How Many Maps Are There In A Particular Job?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

Generally it is around 10-100 maps per-node. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

Suppose, if you expect 10TB of input data and have a block size of 128MB, you'll end up with 82,000 maps, to control the number of block you can use the mapreduce.job.maps parameter (which only provides a hint to the framework). Ultimately, the number of tasks is controlled by the number of splits returned by the InputFormat.getSplits() method (which you can override).

What Is The Reducer Used For?

Reducer reduces a set of intermediate values which share a key to a (usually smaller) set of values.

The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).

What Is The Jobtracker And What It Performs In A Hadoop Cluster?

JobTracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine, and each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

JobTracker in Hadoop performs following actions

Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.

What Are Combiners? When Should I Use A Combiner In My Mapreduce Job?

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

What Is Writable & Writablecomparable Interface?

org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

How A Task Is Scheduled By A Jobtracker?

The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

How Many Instances Of Tasktracker Run On A Hadoop Cluster?

There is one Daemon Tasktracker process for each slave node in the Hadoop cluster.

What Are The Two Main Parts Of The Hadoop Framework?

Hadoop consists of two main parts.

Hadoop distributed file system, a distributed file system with high throughput,
Hadoop MapReduce, a software framework for processing large data sets.

Can Reducer Talk With Each Other?

No, Reducer runs in isolation.

Where The Mapper's Intermediate Data Will Be Stored?

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

What Is The Use Of Combiners In The Hadoop Framework?

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.

You can use your reducer code as a combiner if the operation performed is commutative and associative.

The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also, if required it may execute it more than 1 times. Therefore your MapReduce jobs should not depend on the combiners’ execution.

What Is The Hadoop Mapreduce Api Contract For A Key And Value Class?

The Key must implement the org.apache.hadoop.io.WritableComparable interface.
The value must implement the org.apache.hadoop.io.Writable interface.

What Is A Identitymapper And Identityreducer In Mapreduce?

org.apache.hadoop.mapred.lib.IdentityMapper: Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer does not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
org.apache.hadoop.mapred.lib.IdentityReducer : Performs no reduction, writing all input values directly to the output. If MapReduce programmer does not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

Explain The Use Of Tasktracker In The Hadoop Cluster?

A Tasktracker is a slave node in the cluster which that accepts the tasks from JobTracker like Map, Reduce or shuffle operation. Tasktracker also runs in its own JVM Process.

Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.

The Tasktracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker.

The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What Do You Mean By Taskinstance?

Task instances are the actual MapReduce jobs which run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker.Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

How Many Daemon Processes Run On A Hadoop Cluster?

Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.

Following 3 Daemons run on Master nodes.

NameNode : This daemon stores and maintains the metadata for HDFS.

Secondary NameNode : Performs housekeeping functions for the NameNode.

JobTracker : Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes

DataNode : Stores actual HDFS data blocks.

TaskTracker : It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

How Many Maximum Jvm Can Run On A Slave Node?

One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

What Is Nas?

It is one kind of file system where data can reside on one centralized machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS.

Subscribe to get more Posts :