Sunteți pe pagina 1din 29

Your clusters HDFS block size in 64MB.

You have directory containing 100 plain


text files, each of which is 100MB in size. The InputFormat for your job is
TextInputFormat. Determine how many Mappers will run?
A. 64
B. 100
C. 200
D. 640
Answer: C
Can you use MapReduce to perform a relational join on two large tables sharing a
key?
Assume that the two tables are formatted as comma-separated files in HDFS.
A. Yes.
B. Yes, but only if one of the tables fits into memory
C. Yes, so long as both tables fit into memory.
D. No, MapReduce cannot perform relational operations.
E. No, but it can be done with either Pig or Hive.
Answer: A
Which process describes the lifecycle of a Mapper?
A. The JobTracker calls the TaskTrackers configure () method, then its map () method and
finally
its close () method.
B. The TaskTracker spawns a new Mapper to process all records in a single input split.
C. The TaskTracker spawns a new Mapper to process each key-value pair.
D. The JobTracker spawns a new Mapper to process all records in a single file.
Answer: C
In a MapReduce job with 500 map tasks, how many map task attempts will there
be?
A. It depends on the number of reduces in the job.
B. Between 500 and 1000.
C. At most 500.
D. At least 500.
E. Exactly 500.
Answer: D
MapReduce v2 (MRv2 /YARN) splits which major functions of the JobTracker into
separate
daemons? Select two.
A. Heath states checks (heartbeats)
B. Resource management
C. Job scheduling/monitoring
D. Job coordination between the ResourceManager and NodeManager
E. Launching tasks
F. Managing file system metadata
G. MapReduce metric reporting
H. Managing tasks
Answer: B,D
When is the earliest point at which the reduce method of a given Reducer can be
called?
A. As soon as at least one mapper has finished processing its input split.

B. As soon as a mapper has emitted at least one record.


C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
In a large MapReduce job with m mappers and n reducers, how many distinct copy
operations will there be in the sort/shuffle phase?
A. m * n (i.e., m multiplied by n)
B. n
C. m
D. m+n (i.e., m plus n)
E. E.mn(i.e., m to the power of n)
Answer: A
For each intermediate key, each reducer task can emit:
A. As many final key-value pairs as desired. There are no restrictions on the types of those
keyvalue
pairs (i.e., they can be heterogeneous).
B. As many final key-value pairs as desired, but they must have the same type as the
intermediate
key-value pairs.
C. As many final key-value pairs as desired, as long as all the keys have the same type and
all the
values have the same type.
D. One final key-value pair per value associated with the key; no restrictions on the type.
E. One final key-value pair per key; no restrictions on the type.
Answer: E
You need to move a file titled weblogs into HDFS. When you try to copy the file,
you cant.
You know you have ample space on your DataNodes. Which action should you take
to relieve this situation and store more files in HDFS?
A. Increase the block size on all current files in HDFS.
B. Increase the block size on your remaining files.
C. Decrease the block size on your remaining files.
D. Increase the amount of memory for the NameNode.
E. Increase the number of disks (or size) for the NameNode.
F. Decrease the block size on all current files in HDFS.
Answer: C
Indentify which best defines a SequenceFile?
A. A SequenceFile contains a binary encoding of an arbitrary number of homogeneous
Writable objects
B. A SequenceFile contains a binary encoding of an arbitrary number of heterogeneous
Writable objects
C. A SequenceFile contains a binary encoding of an arbitrary number of
WritableComparable objects, in sorted order.
D. A SequenceFile contains a binary encoding of an arbitrary number key-value pairs. Each
key must be the same type. Each value must be the same type.
Answer: D

When is the earliest point at which the reduce method of a given Reducer can be
called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
Explanation

Which
describes
how
a
client
reads
a
file
from
HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the
block location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the
requested data responds directly to the client. The client reads the data directly off
the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then queries
the DataNodes for block locations. The DataNodes respond to the NameNode, and the
NameNode redirects the client to the DataNode that holds the requested data block(s). The
client
then
reads
the
data
directly
off
the
DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client
Answer: C

You are developing a combiner that takes as input Text keys, IntWritable values,
and emits Text keys, IntWritable values. Which interface should your class
implement?
A.
Combiner
<Text,
IntWritable,
Text,
B.
Mapper
<Text,
IntWritable,
Text,
C.
Reducer
<Text,
Text,
IntWritable,
D.
Reducer
<Text,
IntWritable,
Text,
E. Combiner <Text, Text, IntWritable, IntWritable>
Answer:

IntWritable>
IntWritable>
IntWritable>
IntWritable>

Indentify the utility that allows you to create and run MapReduce jobs with any
executable
or
script
as
the
mapper
and/or
the
reducer?
A.
Oozie
B.
Sqoop

C.
D.
E.
Answer:

Flume
Streaming
mapred
D

Hadoop

Explanation:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or
the
reducer.

How are keys and values presented and passed to the reducers during a standard
sort
and shuffle
phase
of
MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in
ascending
order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending
order.
Answer:
A
Explanation:
Reducer

has

primary

phases:

1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have
output
the
same
key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they
are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the
application should extend the key with the secondary key and define a grouping comparator.
The keys will be sorted using the entire key, but will be grouped using the
grouping comparator to decide which keys and values are sent in the same call to reduce.
3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each (collection of
values)>
in
the
sorted
inputs.
The output of the reduce task is typically written to a RecordWriter
via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not resorted.

Assuming default settings, which best describes the order of data provided to a
reducers
reduce
method:
A. The keys given to a reducer arent in a predictable order, but the values associated with
those
keys
always
are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C.
Neither
keys
nor
values
are
in
any
predictable
order.
D. The keys given to a reducer are in sorted order but the values associated with each key
are
in
no
predictable
order
Answer:
D
Explanation:
Reducer
has
3
primary
phases:
1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the
network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may
have output the same key). The shuffle and sort phases occur simultaneously i.e. while
outputs
are
being
fetched they
are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the
application should extend the key with the secondary key and define a grouping comparator.
The keys will be sorted using the entire key, but will be grouped using the
grouping comparator to decide which keys and values are sent in the same call to reduce.
3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each(collection of
values)> in the sorted inputs. The output of the reduce task is typically written to a
RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer
is
not
re-sorted

You wrote a map function that throws a runtime exception when it encounters a
control character in input data. The input supplied to your mapper contains twelve
such characters totals, spread across five file splits. The first four file splits each
have two control characters and the last split has four control characters.
Indentify the number of failed task attempts you can expect when you run the job
with
mapred.max.map.attempts
set
to
4:
A.
You
will
have
forty-eight
failed
task
attempts
B.
You
will
have
seventeen
failed
task
attempts
C.
You
will
have
five
failed
task
attempts
D.
You
will
have
twelve
failed
task
attempts
E.
You
will
have
twenty
failed
task
attempts
Answer:
E
Explanation:

There

will

be

four

failed

task

attempts

for

each

of

the

five

file

splits.

You want to populate an associative array in order to perform a map-side


join. Youve decided to put this information in a text file, place that file into
the DistributedCache and read it in your Mapper before any records are
processed. Indentify which method in the Mapper you should use to implement
code
for reading
the
file
and
populating
the
associative
array?
A.
B.
C.
D.
Answer:

combine
map
init
configure
D

Explanation:
See 3) below. Here is an illustrative example on how to use the DistributedCache:
//
Setting
up
the
cache
for
the
application
1.
Copy
the
requisite
files
to
the
FileSystem:
$
bin/hadoop
fs
-copyFromLocal
lookup.dat
/myapp/lookup.dat
$
bin/hadoop
fs
-copyFromLocal
map.zip
/myapp/map.zip
$
bin/hadoop
fs
-copyFromLocal
mylib.jar
/myapp/mylib.jar
$
bin/hadoop
fs
-copyFromLocal
mytar.tar
/myapp/mytar.tar
$
bin/hadoop
fs
-copyFromLocal
mytgz.tgz
/myapp/mytgz.tgz
$
bin/hadoop
fs
-copyFromLocal
mytargz.tar.gz
/myapp/mytargz.tar.gz

2.
Setup
the
application's
JobConf:
JobConf
job
=
new
JobConf();
DistributedCache.addCacheFile(new
URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new
URI("/myapp/map.zip",
job);
DistributedCache.addFileToClassPath(new
Path("/myapp/mylib.jar"),
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytar.tar",
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytgz.tgz",
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytargz.tar.gz",
job);
3. Use the
extends
private
private
public
//
localArchives
localFiles
}
public void
throws
//
Use
//
//
}

cached

files in theMapper orReducer: public static class MapClass


MapReduceBase
implements
Mapper {
Path[]
localArchives;
Path[]
localFiles;
void
configure(JobConf
job)
{
Get
the
cached
archives/files
=
DistributedCache.getLocalCacheArchives(job);
=
DistributedCache.getLocalCacheFiles(job);

map(K

key,

data

V
from

...

value,

OutputCollector output, Reporter


IOException
the
cached
archives/files
output.collect(k,

reporter)
{
here
...
v);

}
Youve written a MapReduce job that will process 500 million input records
and generated 500 million key-value pairs. The data is not uniformly distributed.
Your MapReduce job will create a significant amount of intermediate data that it
needs to transfer between mappers and reduces which is a potential bottleneck. A
custom implementation of which interface is most likely to reduce the amount
of intermediate
data
transferred
across
the
network?
A.
B.
C.
D.
E.
F.
Answer:

Partitioner
OutputFormat
WritableComparable
Writable
InputFormat
Combiner
F

Explanation:
Combiners are used to increase the efficiency of a MapReduce program. They are used to
aggregate intermediate map output locally on individual mapper outputs. Combiners can

help you reduce the amount of data that needs to be transferred across to the reducers. You
can use your reducer code as a combiner if the operation performed is commutative and
associative.
Can you use MapReduce to perform a relational join on two large tables sharing
a key? Assume that the two tables are formatted as comma-separated files in
HDFS.
A.
Yes.
B.
Yes,
but
only
if
one
of
the
tables
fits
into
memory
C.
Yes,
so
long
as
both
tables
fit
into
memory.
D.
No,
MapReduce
cannot
perform
relational
operations.
E.
No,
but
it
can
be
done
with
either
Pig
or
Hive.
Answer:
A
Explanation:
Join Algorithms in MapReduce

Reduce-side join

Map-side join

You''ve written a MapReduce job that will process 500 million input records and
generated 500 million key-value pairs. The data is not uniformly distributed. Your
MapReduce job will create asignificant amount of intermediate data that it needs
to transfer between mappers and reduces which is a potential bottleneck. A
custom implementation of which interface is most likely to reduce the amount of
intermediate
data
transferred
across
the
network?
A.
Partitioner
B.
OutputFormat
C.
WritableComparable
D.
Writable
E.
InputFormat
F.
Combiner

You have just executed a MapReduce job. Where is intermediate data written to
after
being emitted
from
the
Mapper''s
map
method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is
never written
to
disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and
are written
into
HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node
running the
Mapper.

D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the
TaskTracker node
running
the
Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and
are written
into
HDFS.

You want to understand more about how users browse your public website, such
as which pages they visit prior to placing an order. You have a farm of 200 web
servers hosting your website. How will you gather this data for your analysis?
1.

Ingest the server web logs into HDFS using Flume.

2.
Write a MapReduce job, with the web servers for mappers, and the Hadoop
cluster nodes forreduces.
3.

Import all users'' clicks from your OLTP databases into Hadoop, using Sqoop.

4.

Channel these clickstreams inot Hadoop using Hadoop Streaming.

5.
curl.

Sample the weblogs from the web servers, copying them into Hadoop using

MapReduce
v2
(MRv2/YARN)
is
designed
to
address
which
two
issues?
A.
Single
point
of
failure
in
the
NameNode.
B.
Resource
pressure
on
the
JobTracker.
C.
HDFS
latency.
D.
Ability
to
run
frameworks
other
than
MapReduce,
such
as
MPI.
E.
Reduce
complexity
of
the
MapReduce
APIs.
F.
Standardize
on
a
single
MapReduce
API.

You need to run the same job many times with minor variations. Rather than
hardcoding all job configuration options in your drive code, you''ve decided to
have your Driver subclass org.apache.hadoop.conf.Configured and implement the
org.apache.hadoop.util.Tool interface. Indentify which invocation correctly
passes.mapred.job.name
with
a
value
of
Example
to
Hadoop?
A.
B.
C.
D.
E.

hadoop
''mapred.job.name=Example''
MyDriver
input
hadoop
MyDriver
mapred.job.name=Example
input
hadoop
MyDrive
''D
mapred.job.name=Example
input
hadoop
setproperty
mapred.job.name=Example
MyDriver
input
hadoop
setproperty
(''mapred.job.name=Example'')
MyDriver
input

output
output
output
output
output

You are developing a MapReduce job for sales reporting. The mapper will process
input keys representing the year (IntWritable) and input values representing
product indentifies (Text). Indentify what determines the data types used by the

Mapper

for

given

job.

1.
The key and value types specified in the JobConf.setMapInputKeyClass
and JobConf.setMapInputValuesClass methods
2.

The data types specified in HADOOP_MAP_DATATYPES environment variable

3.
The mapper-specification.xml file submitted with the job determine the
mapper''s input key and value types.
4.
The InputFormat used by the job determines the mapper''s input key and
value types.

Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching


application containers
and
monitoring
application
resource
usage?
A.
ResourceManager
B.
NodeManager
C.
ApplicationMaster
D.
ApplicationMasterService
E.
TaskTracker
F.
JobTracker
Which best describes how TextInputFormat processes input files and line breaks?
A. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader of the split that contains the beginning of the broken line.
B. Input file splits may cross line breaks. A line that crosses file splits is read by
the RecordReaders
of
both
splits
containing
the
broken
line.
C. The input file is split exactly at the line breaks, so each RecordReader will read a series
of complete
lines.
D. Input file splits may cross line breaks. A line that crosses file splits is ignored.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader of
the
split
that
contains
the
end
of
the
broken
line.

For
each
input
key-value
pair,
mappers
can
emit:
A. As many intermediate key-value pairs as designed. There are no restrictions on the types
of those
key-value
pairs
(i.e.,
they
can
be
heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type
as
the input
key-value
pair.
C.
One
intermediate
key-value
pair,
of
a
different
type.
D.
One
intermediate
key-value
pair,
but
of
the
same
type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same
types and
all
the
values
have
the
same
type.
You have
(the,
(fox,

the

following

key-value

pairs

as

output

from

your

Map

task:
1)
1)

(faster,
(than,
(the,
(dog,
How many
A.
B.
C.
D.
E.
F.

keys

will

be

passed

to

the

Reducer''s

reduce

1)
1)
1)
1)
method?
Six
Five
Four
Two
One
Three

You have user profile records in your OLPT database, that you want to join with
web logs you have already ingested into the Hadoop file system. How will you
obtain
these
user
records?
A.
HDFS
command
B.
Pig
LOAD
command
C.
Sqoop
import
D.
Hive
LOAD
DATA
command
E.
Ingest
with
Flume
agents
F.
Ingest
with
Hadoop
Streaming

Which two updates occur when a client application opens a stream to begin a file
write
on
a
cluster running
MapReduce
v1
(MRv1)?
A. Once the write stream closes on the DataNode, the DataNode immediately initiates a
black
report
to
the
NameNode.
B.
The
change
is
written
to
the
NameNode
disk.
C.
The
metadata
in
the
RAM
on
the
NameNode
is
flushed
to
disk.
D.
The
metadata
in
RAM
on
the
NameNode
is
flushed
disk.
E.
The
metadata
in
RAM
on
the
NameNode
is
updated.
F.
The
change
is
written
to
the
edits
file.
Answer

For a MapReduce job, on a cluster running MapReduce v1 (MRv1), what''s the


relationship between
tasks
and
task
templates?
A.
B.
C.
D.

There are always at least as many task attempts as there are tasks.
There are always at most as many tasks attempts as there are tasks.
There are always exactly as many task attempts as there are tasks.
The developer sets the number of task attempts on job submission.

What action occurs automatically on a cluster when a DataNode is marked as


dead?

A. The NameNode forces re-replication of all the blocks which were stored on the dead
DataNode.
B. The next time a client submits job that requires blocks from the dead DataNode, the
JobTracker receives no heart beats from the DataNode. The JobTracker tells the NameNode
that the DataNode is dead, which triggers block re-replication on the cluster.
C. The replication factor of the files which had blocks stored on the dead DataNode is
temporarily reduced, until the dead DataNode is recovered and returned to the cluster.
D. The NameNode informs the client which write the blocks that are no longer available; the
client
then
re-writes
the
blocks
to
a
different
DataNode.
How does the NameNode know DataNodes are available on a cluster running
MapReduce
v1(MRv1)
A. DataNodes listed in the dfs.hosts file. The NameNode uses as the definitive list of
available
DataNodes.
B.
DataNodes
heartbeat
in
the
master
on
a
regular
basis.
C. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes
respond.
D. The NameNode send a broadcast across the network when it first starts, and DataNodes
respond.
Which three distcp features can you utilize on a Hadoop cluster?
A. Use distcp to copy files only between two clusters or more. You cannot use distcp to copy
data
between
directories
inside
the
same
cluster.
B.
Use
distcp
to
copy
HBase
table
files.
C. Use distcp to copy physical blocks from the source to the target destination in your
cluster.
D. Use distcp to copy data between directories inside the same cluster.
E.
Use
distcp
to
run
an
internal
MapReduce
job
to
copy
files.
How
does
HDFS
Federation
help
HDFS
Scale
horizontally?
A. HDFS Federation improves the resiliency of HDFS in the face of network issues by
removing
the
NameNode
as
a
single-point-of-failure.
B. HDFS Federation allows the Standby NameNode to automatically resume the services of
an
active
NameNode.
C. HDFS Federation provides cross-data center (non-local) support for HDFS, allowing a
cluster administrator to split the Block Storage outside the local cluster.
D. HDFS Federation reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual pars of the filesystem namespace.
Choose which best describe a Hadoop cluster's block size storage parameters once
you
set
the
HDFS
default
block
size
to
64MB?
A. The block size of files in the cluster can be determined as the block is written.
B. The block size of files in the Cluster will all be multiples of 64MB.
C. The block size of files in the duster will all at least be 64MB.
D. The block size of files in the cluster will all be the exactly 64MB.

Which MapReduce daemon instantiates user code, and executes map and reduce
tasks
on
a cluster
running
MapReduce
v1
(MRv1)?
A.
NameNode
B.
DataNode
C.
JobTracker
D.
TaskTracker
E.
ResourceManager
F.
ApplicationMaster
G.
NodeManager
What two processes must you do if you are running a Hadoop cluster with a single
NameNode and six DataNodes, and you want to change a configuration parameter
so
that
it
affects
all
six
DataNodes.
A. You must restart the NameNode daemon to apply the changes to the cluster
B. You must restart all six DataNode daemons to apply the changes to the cluster.
C. You don't need to restart any daemon, as they will pick up changes automatically.
D. You must modify the configuration files on each of the six DataNode machines.
E. You must modify the configuration files on only one of the DataNode machine
F. You must modify the configuration files on the NameNode only. DataNodes read their
configuration
from
the
master
nodes.
Identify the function performed by the Secondary NameNode daemon on a cluster
configured
to run
with
a
single
NameNode.
A. In this configuration, the Secondary NameNode performs a checkpoint operation on the
files
by
the
NameNode.
B. In this configuration, the Secondary NameNode is standby NameNode, ready to failover
and
provide
high
availability.
C. In this configuration, the Secondary NameNode performs deal-time backups of the
NameNode.
D. In this configuration, the Secondary NameNode servers as alternate data channel for
clients
to
reach
HDFS,
should
the
NameNode
become
too
busy.
You install Cloudera Manager on a cluster where each host has 1 GB of RAM. All of
the services show their status as concerning. However, all jobs submitted
complete without an error. Why is Cloudera Manager showing the concerning
status
KM
the
services?
A.
A
slave
node's
disk
ran
out
of
space
B.
The
slave
nodes,
haven't
sent
a
heartbeat
in
60
minutes
C.
The
slave
nodes
are
swapping.
D.
DataNode
service
instance
has
crashed.
What is the recommended disk configuration for slave nodes in your Hadoop
cluster
with
6
x
2
TB hard
drives?
A.
RAID
10
B.
JBOD
C.
RAID
5

D.

RAID

1+0

You configure you cluster with HDFS High Availability (HA) using Quorum-Based
storage.
You
do not
implement
HDFS
Federation.
What is the maximum number of NameNodes daemon you should run on you
cluster in order to avoid a ''split-brain'' scenario with your NameNodes?
A. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number
of
NameNodes
you
can
deploy.
B.
Two
active
NameNodes
and
one
Standby
NameNode
C.
One
active
NameNode
and
one
Standby
NameNode
D.
Two
active
NameNodes
and
two
Standby
NameNodes
You configure Hadoop cluster with both MapReduce frameworks, MapReduce v1
(MRv1) and MapReduce v2 (MRv2/YARN). Which two MapReduce (computational)
daemons do you need to configure to run on your master nodes?
A.
JobTracker
B.
ResourceManager
C.
ApplicationMaster
D.
JournalNode
E.
NodeManager

You observe that the number of spilled records from map tasks for exceeds the
number of map output records. You child heap size is 1 GB and your io.sort.mb
value is set to 100MB. How would you tune your io.sort.mb value to achieve
maximum
memory
to
disk
I/O
ratio?
A. Tune io.sort.mb value until you observe that the number of spilled records equals (or is
as
close
to
equals)
the
number
of
map
output
records.
B.
Decrease
the
io.sort.mb
value
below
100MB.
C. Increase the IO.sort.mb as high you can, as close to 1GB as possible.
D. For 1GB child heap size an io.sort.mb of 128MB will always maximum memory to disk
I/O.

Your Hadoop cluster has 25 nodes with a total of 100 TB (4 TB per node) of raw
disk space allocated HDFS storage. Assuming Hadoop's default configuration, how
much
data
will
you
be
able
to
store?
A.
Approximately
100TB
B.
Approximately
25TB
C.
Approximately
10TB
D.
Approximately
33
TB
You set up the Hadoop cluster using NameNode Federation. One NameNode
manages the/users namespace and one NameNode manages the/data namespace.

What happens when client tries to write a file to/reports/myreport.txt?


A.
The
file
successfully
writes
to
/users/reports/myreports/myreport.txt.
B.
The
client
throws
an
exception.
C. The file successfully writes to /report/myreport.txt. The metadata for the file is managed
by
the
first
NameNode
to
which
the
client
connects.
D. The file writes fails silently; no file is written, no error is reported.
Identify two features/issues that MapReduce v2 (MRv2/YARN) is designed to
address:
A.
Resource
pressure
on
the
JobTrackr
B.
HDFS
latency.
C.
Ability
to
run
frameworks
other
than
MapReduce,
such
as
MPI.
D.
Reduce
complexity
of
the
MapReduce
APIs.
E.
Single
point
of
failure
in
the
NameNode.
F.
Standardize
on
a
single
MapReduce
API.
The most important consideration for slave nodes in a Hadoop cluster running
production
jobs
that
require
short
turnaround
times
is:
A. The ratio between the amount of memory and the number of disk drives.
B. The ratio between the amount of memory and the total storage capacity.
C. The ratio between the number of processor cores and the amount of memory.
D. The ratio between the number of processor cores and total storage capacity.
E. The ratio between the number of processor cores and number of disk drives.

The failure of which daemon makes HDFS unavailable on a cluster running


MapReduce
v1(MRv1)?
A.
Node
Manager
B.
Application
Manager
C.
Resource
Manager
D.
Secondary
NameNode
E.
NameNode
F.
DataNode

What is the difference between a Hadoop database and Relational Database?


Hadoop is not a database, it is an architecture with a filesystem called HDFS. The data is
stored
in
HDFS
which
does
not
have
any
predefined
containers.
Relational
database
stores
data
in
predefined
containers.
what
is
HDFS?
Stands for Hadoop Distributed File System. It uses a framework involving many machines
which
stores
large
amounts
of
data
in
files
over
a
Hadoop
cluster.

what

is

MAP

REDUCE?

Map Reduce is a set of programs used to access and manipulate large data sets over a
Hadoop
cluster.
What

is

the

InputSplit

in

map

reduce

software?

An inputsplit is the slice of data to be processed by a single Mapper. It generally is of the


block
size
which
is
stored
on
the
datanode.
what

is

meaning

Replication

factor?

Replication factor defines the number of times a given data block is stored in the cluster.
The default replication factor is 3. This also means that you need to have 3times the amount
of storage needed to store the data. Each file is split into data blocks and spread across the
cluster.
what

is

the

default

replication

factor

in

HDFS?

The default hadoop comes with 3 replication factor. You can set the replication level
individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs
that consume the same data to be run in parallel. Also if there are replicas of the data
hadoop can attempt to run multiple copies of the same task and take which ever finishes
first.
This
is
useful
if
for
some
reason
a
box
is
being
slow.
Most Hadoop administrators set the default replication factor for their files to be three. The
main assumption here is that if you keep three copies of the data, your data is safe. this to
be
true
in
the
big
clusters
that
we
manage
and
operate.
In addition to fault tolerance having replicas allow jobs that consume the same data to be
run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple
copies of the same task and take which ever finishes first. This is useful if for some reason a
box
is
being
slow.
what

is

Default

the

typical

blocksize

is

block
64mb.

size

of

But

an
128mb

HDFS
is

block?
typical.

what
is
namenode?
Name node is one of the daemon that runs in Master node and holds the meta info where
particular chunk of data (ie. data node ) resides.Based on meta info maps the incoming job
to
corresponding
data
node.
How

does

Totally

master
daemons

slave
run

in

architecture
Hadoop

in

Master-slave

the

Hadoop?

architecture

On Master Node : Name Node and Job Tracker and Secondary name node
On
Slave
:
Data
Node
and
Task
Tracker
But its recommended to run Secondary name node in a separate machine which have
Master
node
capacity.
What

is

I
do
Distributed
Distributed
Name Node
Explain

compute
define
Processing

holds Meta info

how

input

and

and

Hadoop

Storage

into
:

Storage
and Data

output

ways

Map
holds

data

exact

format

of

nodes?
:
Reduce
HDFS
its MR program.

:
data and
the

Hadoop

framework?

Fileinputformat,
textinputformat,
keyvaluetextinputformat,
sequencefileinputformat,
sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework
How

can

we

control

By

using

What

is

Reducer
What

is

particular

key

to

are

combine

the

Reducer

has

What

happens

go

if

the

specific

multiple

phases:
number

used

outputs

phases

of

of

shuffle,
of

reducer?
partitioner.

Reducer

primary
primary

in

custom

the

used

should

for?

mapper
the

sort

to

Reducer?
and

reducers

one.

are

reduce.
0?

It is legal to set the number of reduce-tasks to zero if no reduction is desired.


In this case the outputs of the map-tasks go directly to the FileSystem, into the output
path set by setOutputPath(Path). The framework does not sort the map-outputs before
writing
them
out
to
the
FileSystem.
How

many

instances

of

JobTracker

can

run

on

Hadoop

Cluser?

One. There can only be one JobTracker in the cluster. This can be run on the same machine
running
the
NameNode.
How

NameNode

Handles

data

node

failures?

Through checksums. every data has a record followed by a checksum. if checksum doesnot
match
with
the
original
then
it
reports
an
data
corrupted
error.

Can

set

the

number

of

reducers

to

zero?

can be given as zero. So, the mapper output is an finalised output and stores in HDFS.
What

is

SequenceFile

in

Hadoop?

A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable


objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous
writable
objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable
objects,
in
sorted
order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each
key
must
be
the
same
type.
Each
value
must
be
sametype.
Answer:
Is

D
there

map

A.
Yes,
but
B.
Yes,
there
is
C.
No,
but
sequence
D.
Both
2
Answers:

input

a
file
and

format

in

Hadoop?

only
in
Hadoop
special
format
for
input
format
can
read
3
are
correct

0.22+.
map
files.
map
files.
answers.
C

What happens if mapper output does not match reducer input in Hadoop?
A. Hadoop API will convert the data to the type that is needed by the reducer.
B. Data input/output inconsistency cannot occur. A preliminary validation check is executed
prior to the full execution of the job to ensure there is consistency.
C. The java compiler will report an error during compilation but the job will complete with
exceptions.
D.
A real-time exception will be thrown and map-reduce job will fail.
Answer:
Can
A.
B.
C.
D.

you

D
provide
Yes,

No,
Yes,
Yes,

Answer:

multiple

input

but
Hadoop
always
developers
can
but
the
limit
is

paths

to

only
operates
add
any
currently

map-reduce

jobs

Hadoop?

in
Hadoop
0.22+.
on
one
input
directory.
number
of
input
paths.
capped
at
10
input
paths.
C

Can a custom type for data Map-Reduce processing be implemented in Hadoop?

A.
No,
Hadoop
does
not
provide
techniques
for
custom
datatypes.
B.
Yes,
but
only
for
mappers.
C. Yes, custom data types can be implemented as long as they implement writable
interface.
D.
Yes,
but
only
for
reducers.
Answer:

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable.
They have almost the same features as default java classes. What are these
writable
data
types
optimized
for?
A.
B.
C.
D.

Writable data types are specifically optimized for network transmissions


Writable data types are specifically optimized for file system storage
Writable data types are specifically optimized for map-reduce processing
Writable
data
types
are
specifically
optimized
for
data
retrieval

Answer:

What

is

writable

in

Hadoop?

A. Writable is a java interface that needs to be implemented for streaming data to remote
servers.
B. Writable is a java interface that needs to be implemented for HDFS writes.
C. Writable is a java interface that needs to be implemented for MapReduce processing.
D.
None
of
these
answers
are
correct.
Answer:
What

is

C
the

best

performance

one

can

expect

from

Hadoop

cluster?

A. The best performance expectation one can have is measured in seconds. This is because
Hadoop
can
only
be
used
for
batch
processing
B. The best performance expectation one can have is measured in milliseconds. This is
because
Hadoop
executes
in
parallel
across
so
many
machines
C. The best performance expectation one can have is measured in minutes. This is
because
Hadoop
can
only
be
used
for
batch
processing
D. It depends on on the design of the map-reduce program, how many machines in the
cluster,
and
the
amount
of
data
being
retrieved
Answer:
What

A
is

distributed

cache

in

Hadoop?

A. The distributed cache is special component on namenode that will cache frequently used
data
for
faster
client
response.
It
is
used
during
reduce
step.
B. The distributed cache is special component on datanode that will cache frequently used
data
for
faster
client
response.
It
is
used
during
map
step.
C.
The
distributed
cache
is
a
component
that
caches
java
objects.
D. The distributed cache is a component that allows developers to deploy jars for MapReduce
processing.
Answer:
Can

you

run

Map

Reduce

jobs

directly

on

Avro

data

in

Hadoop?

A.
Yes, Avro was specifically designed for data processing via Map-Reduce
B.
Yes,
but
additional
extensive
coding
is
required
C.
No,
Avro
was
specifically
designed
for
data
storage
only
D. Avro specifies metadata that allows easier data access. This data cannot be used as
part
of
map-reduce
execution,
rather
input
specification
only.
Answer:

What
A.
B.
C.
D.

is
Avro
Avro
Avro
is
None

AVRO
is
is
a

a
a
java

of

library
these

in

Hadoop?

java
serialization
java
compression
that
create
splittable
answers
are

Answer:

library
library
files
correct
A

Will settings using Java API overwrite values in configuration files in Hadoop?
A.
No. The configuration settings in the configuration file takes precedence
B.
Yes.
The
configuration
settings
using
Java
API
take
precedence
C. It depends when the developer reads the configuration file. If it is read first then no.
D. Only global configuration settings are captured in configuration files on namenode.
There are only a very few job parameters that can be set using Java API.
Answer:
Which

B
is

faster:

Map-side

join

or

Reduce-side

join?

Why?

A.
Both techniques have about the the same performance expectations.
B.
Reduce-side
join
because
join
operation
is
done
on
HDFS.
C.
Map-side join is faster because join operation is done in memory.
D. Reduce-side join because it is executed on a the namenode which will have faster CPU
and
more
memory.

Answer:
What

C
are

the

common

problems

with

map-side

join

in

Hadoop?

A. The most common problem with map-side joins is introducing a high level of code
complexity. This complexity has several downsides: increased risk of bugs and performance
degradation.
Developers
are
cautioned
to
rarely
use
map-side
joins.
B. The most common problem with map-side joins is lack of the avaialble map slots since
map-side
joins
require
a
lot
of
mappers.
C. The most common problems with map-side joins are out of memory exceptions on slave
nodes.
D. The most common problem with map-side join is not clearly specifying primary index in
the
join.
This
can
lead
to
very
slow
performance
on
large
datasets.
Answer:
How

C
can

you

overwrite

the

default

input

format

in

Hadoop?

A. In order to overwrite default input format, the Hadoop administrator has to change
default
settings
in
config
file.
B. In order to overwrite default input format, a developer has to set new input format on
job
config
before
submitting
the
job
to
a
cluster.
C. The default input format is controlled by each individual mapper and each line needs to
be
parsed
indivudually.
D.
None
of
these
answers
are
correct.
Answer:
What

B
is

the

default

input

format

in

Hadoop?

A. The default input format is xml. Developer can specify other input formats as
appropriate
if
xml
is
not
the
correct
input.
B. There is no default input format. The input format always should be specified.
C. The default input format is a sequence file format. The data needs to be preprocessed
before
using
the
default
input
format.
D. The default input format is TextInputFormat with byte offset as a key and entire line as
a
value.
Answer:

Why would a developer create a map-reduce without the reduce step Hadoop?
A. Developers should design Map-Reduce jobs without reducers only if no reduce slots are
available
on
the
cluster.
B. Developers should never design Map-Reduce jobs without reducers. An error will occur

upon
compile.
C. There is a CPU intensive step that occurs between the map and reduce steps. Disabling
the
reduce
step
speeds
up
data
processing.
D. It is not possible to create a map-reduce job without at least one reduce step. A
developer
may
decide
to
limit
to
one
reducer
for
debugging
purposes.
Answer:
How

C
can

you

disable

the

reduce

step

in

Hadoop?

A. The Hadoop administrator has to set the number of the reducer slot to zero on all slave
nodes.
This
will
disable
the
reduce
step.
B. It is imposible to disable the reduce step since it is critical part of the Mep-Reduce
abstraction.
C. A developer can always set the number of the reducers to zero. That will completely
disable
the
reduce
step.
D. While you cannot completely disable reducers you can set output to one. There needs to
be
at
least
one
reduce
step
in
Map-Reduce
abstraction.
Answer:

What

is

PIG?

in

Hadoop

A.
Pig
is
a
subset
fo
the
Hadoop
API
for
data
processing
B. Pig is a part of the Apache Hadoop project that provides C-like scripting languge
interface
for
data
processing
C. Pig is a part of the Apache Hadoop project. It is a "PL-SQL" interface for data processing
in
Hadoop
cluster
D. PIG is the third most popular form of meat in the US behind poultry and beef.
Answer:
What

B
is

reduce

side

join

in

Hadoop?

A. Reduce-side join is a technique to eliminate data from initial data set at reduce step
B. Reduce-side join is a technique for merging data from different sources based on a
specific
key.
There
are
no
memory
restrictions
C. Reduce-side join is a set of API to merge data from different sources.
D.
None
of
these
answers
are
correct
Answer:
What
A.
B.

B
is

map

side

join

in

Hadoop?

Map-side
join
is
done
in
the
map
phase
and
done
in
memory
Map-side join is a technique in which data is eliminated at the map step

C. Map-side join is a form of map-reduce API which joins data from different locations
D.
None
of
these
answers
are
correct
Answer:
How

A
can

you

use

binary

data

in

MapReduce

in

Hadoop?

A. Binary data can be used directly by a map-reduce job. Often binary data is added to a
sequence
file.
B. Binary data cannot be used by Hadoop fremework. Binary data should be converted to a
Hadoop
compatible
format
prior
to
loading.
C. Binary can be used in map-reduce only with very limited functionlity. It cannot be used
as
a
key
for
example.
D. Hadoop can freely use binary files with map-reduce jobs so long as the files have
headers
Answer:
What

A
are

map

files

and

why

are

they

important

in

Hadoop?

A. Map files are stored on the namenode and capture the metadata for all blocks on a
particular
rack.
This
is
how
Hadoop
is
"rack
aware"
B. Map files are the files that show how the data is distributed in the Hadoop cluster.
C. Map files are generated by Map-Reduce after the reduce step. They show the task
distribution
during
job
execution
D. Map files are sorted sequence files that also have an index. The index allows fast data
look
up.
Answer:
What

D
are

sequence

files

and

why

are

they

important

in

Hadoop?

A. Sequence files are binary format files that are compressed and are splitable. They are
often
used
in
high-performance
map-reduce
jobs
B. Sequence files are a type of the file in the Hadoop framework that allow data to be
sorted
C. Sequence files are intermediate files that are created by Hadoop after the map step
D.
Both
B
and
C
are
correct
Answer:
How
A.
B.
C.

many

A
states

does

Writable

interface

defines

___

in

Hadoop?
Two
Four
Three

D.

None

of

the

above

Answer:

Which method of the FileSystem object is used for reading a file in HDFS in
Hadoop?
A.
B.
C.
D.

None

of

open()
access()
select()
above

the

Answer:

RPC

means______.

A.
B.
C.
D.

Remote
Remote
Remote
None

in

Hadoop?

processing
process
procedure
the

of

call
call
call
above

Answer:
The

switch

A.
B.
C.
D.

given

to

hadoop

None

fs

command

of

for

detailed

A.
B.
C.
D.
Answer:

-show
-help
-?
above

the

Answer:
The

help

B
size

of

None

block

in
512
64
1024
of

HDFS

in

the

hadoop?
bytes
MB
KB
above
B

Which MapReduce phase is theoretically able to utilize features of the underlying


file
system
in
order
to
optimize
parallel
execution
in
Hadoop?
A.
B.

Split
Map

C.

Combine

Ans:

What
A.
B.
C.
Ans:

is

One
One

the
key
key

An

input

and
a
list
and
a
list
arbitrarily

to

the

of
all
of
some
sized

Reduce
values
values
list

function

in

Hadoop?

associated
with
that
key.
associated
with
that
key.
of
key/value
pairs.
A

How can a distributed filesystem such as HDFS provide opportunities for


optimization
of
a
MapReduce
operation?
A.
Data
represented
in
a
distributed
filesystem
is
already
sorted.
B. Distributed filesystems must always be resident in memory, which is much faster than
disk.
C. Data storage and processing can be co-located on the same node, so that most input
data relevant to Map or Reduce will be present on local disks or cache.
D. A distributed filesystem makes random access faster because of the presence of a
dedicated
node
serving
file
metadata.
Ans:

Which of the following MapReduce execution frameworks focus on execution in


shared-memory
environments?
A.
B.
C.
Ans:

Hadoop
Twister
Phoenix
C

What is the implementation language of the Hadoop MapReduce framework?


A.
B.
C.
D.
Ans:

Java
C
FORTRAN
Python
A

The Combine stage, if present, must perform the same aggregation operation as
Reduce
?
A.
True

B.

False

Ans:

Which MapReduce stage serves as a barrier, where all previous stages must be
completed
before
it
may
proceed?
A.
B.
C.
D.

Group

Combine
'shuffle')
Reduce
Write

(a.k.a.

Ans:

Which

TACC

resource

has

support

for

Hadoop

MapReduce?

A.
B.
C.
D.

Ranger
Longhorn
Lonestar
Spur

Ans:
Which

A
of

the

A.
B.
C.
D.
E.

following

scenarios

makes

HDFS

unavailable

in

JobTracker
TaskTracker
DataNode
NameNode
Secondary

failure
failure
failure
failure
failure

NameNode

Answer:
Which
A.
B.
C.
D.
Ans:

Hadoop?

A
TACC

resource

has

support

for

Hadoop

MapReduce

in

Hadoop?
Ranger
Longhorn
Lonestar
Spur
A

Which MapReduce stage serves as a barrier, where all previous stages must be
completed
before
it
may
proceed
in
Hadoop?
A.

Combine

B.
C.
D.

Group

(a.k.a.

'shuffle')
Reduce
Write

Ans:

Which

of

the

A.
B.
C.
D.
E.

following

scenarios

makes

HDFS

JobTracker
TaskTracker
DataNode
NameNode
Secondary

NameNode

Answer:

Map

in

Hadoop?
failure
failure
failure
failure
failure
A

You are running a Hadoop cluster with all


configured.
Which
scenario
will
go
A.
B.
C.
D.
E.

unavailable

or

reduce
tasks
HDFS
The
A
DataNode
MapReduce
jobs
that

that

in
an
infinite
loop.
is
almost
full.
NameNode
goes
down.
is
disconnectedfrom
the
cluster.
are
causing
excessive
memory
swaps.

Answer:

are

monitoring facilities properly


undetected
in
Hadoop?

stuck

Which of the following utilities allows you to create and run MapReduce jobs with
any
executable
or
script
as
the
mapper
and/or
the
reducer?
A.
B.
C.
D.
Answer:

Hadoop

Oozie
Sqoop
Flume
Streaming
D

You need a distributed, scalable, data Store that allows you random, realtime
read/write access to hundreds of terabytes of data. Which of the following would
you
use
in
Hadoop?
A.
B.
C.
D.
E.

Hue
Pig
Hive
Oozie
HBase

F.
G.

Flume
Sqoop

Answer:
Workflows

E
expressed

in

Oozie

can

contain

in

Hadoop?

A. Iterative repetition of MapReduce jobs until a desired answer or state is reached.


B. Sequences of MapReduce and Pig jobs. These are limited to linear sequences of actions
with
exception
handlers
but
no
forks.
C. Sequences of MapReduce jobs only; no Pig or Hive tasks or jobs. These MapReduce
sequences
can
be
combined
with
forks
and
path
joins.
D. Sequences of MapReduce and Pig. These sequences can be combined with other actions
including
forks,
decision
points,
and
path
joins.
Answer:

You have an employee who is a Date Analyst and is very comfortable with SQL. He
would like to run ad-hoc analysis on data in your HDFS duster. Which of the
following is a data warehousing software built on top of Apache Hadoop that
defines a simple SQL-like query language well-suited for this kind of user?
A.
B.
C.
D.
E.
F.
G.
Answer:

Hadoop

Pig
Hue
Hive
Sqoop
Oozie
Flume
Streaming
C

Which of the following statements most accurately describes the relationship


between
MapReduce
and
Pig?
A. Pig provides additional capabilities that allow certain types of data manipulation not
possible
with
MapReduce.
B. Pig provides no additional capabilities to MapReduce. Pig programs are executed as
MapReduce
jobs
via
the
Pig
interpreter.
C. Pig programs rely on MapReduce but are extensible, allowing developers to do specialpurpose
processing
not
provided
by
MapReduce.
D. Pig provides the additional capability of allowing you to control the flow of multiple
MapReduce
jobs.
Answer:

In a MapReduce job, you want each of you input files processed by a single map
task. How do you configure a MapReduce job so that a single map task processes
each input file regardless of how many blocks the input file occupies?
A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplittable to always return
false.
Answer:B
Which of the following best describes the workings of TextInputFormat in
Hadoop?
A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
B. The input file is split exactly at the line breaks, so each Record Reader will read a series
of
complete
lines.
C. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders
of
both
splits
containing
the
brokenline.
D. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader
of
the
split
that
contains
the
end
of
the
brokenline.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader
of
the
split
that
contains
the
beginningof
thebroken
line.
Answer: D

S-ar putea să vă placă și