Documente Academic
Documente Profesional
Documente Cultură
When is the earliest point at which the reduce method of a given Reducer can be
called?
A. As soon as at least one mapper has finished processing its input split.
B. As soon as a mapper has emitted at least one record.
C. Not until all mappers have finished processing all records.
D. It depends on the InputFormat used for the job.
Answer: C
Explanation
Which
describes
how
a
client
reads
a
file
from
HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the
block location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the
requested data responds directly to the client. The client reads the data directly off
the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then queries
the DataNodes for block locations. The DataNodes respond to the NameNode, and the
NameNode redirects the client to the DataNode that holds the requested data block(s). The
client
then
reads
the
data
directly
off
the
DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client
Answer: C
You are developing a combiner that takes as input Text keys, IntWritable values,
and emits Text keys, IntWritable values. Which interface should your class
implement?
A.
Combiner
<Text,
IntWritable,
Text,
B.
Mapper
<Text,
IntWritable,
Text,
C.
Reducer
<Text,
Text,
IntWritable,
D.
Reducer
<Text,
IntWritable,
Text,
E. Combiner <Text, Text, IntWritable, IntWritable>
Answer:
IntWritable>
IntWritable>
IntWritable>
IntWritable>
Indentify the utility that allows you to create and run MapReduce jobs with any
executable
or
script
as
the
mapper
and/or
the
reducer?
A.
Oozie
B.
Sqoop
C.
D.
E.
Answer:
Flume
Streaming
mapred
D
Hadoop
Explanation:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility
allows you to create and run Map/Reduce jobs with any executable or script as the
mapper and/or
the
reducer.
How are keys and values presented and passed to the reducers during a standard
sort
and shuffle
phase
of
MapReduce?
A. Keys are presented to reducer in sorted order; values for a given key are not sorted.
B. Keys are presented to reducer in sorted order; values for a given key are sorted in
ascending
order.
C. Keys are presented to a reducer in random order; values for a given key are not sorted.
D. Keys are presented to a reducer in random order; values for a given key are sorted in
ascending
order.
Answer:
A
Explanation:
Reducer
has
primary
phases:
1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may have
output
the
same
key).
The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they
are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the
application should extend the key with the secondary key and define a grouping comparator.
The keys will be sorted using the entire key, but will be grouped using the
grouping comparator to decide which keys and values are sent in the same call to reduce.
3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each (collection of
values)>
in
the
sorted
inputs.
The output of the reduce task is typically written to a RecordWriter
via TaskInputOutputContext.write(Object, Object). The output of the Reducer is not resorted.
Assuming default settings, which best describes the order of data provided to a
reducers
reduce
method:
A. The keys given to a reducer arent in a predictable order, but the values associated with
those
keys
always
are.
B. Both the keys and values passed to a reducer always appear in sorted order.
C.
Neither
keys
nor
values
are
in
any
predictable
order.
D. The keys given to a reducer are in sorted order but the values associated with each key
are
in
no
predictable
order
Answer:
D
Explanation:
Reducer
has
3
primary
phases:
1.Shuffle
The Reducer copies the sorted output from each Mapper using HTTP across the
network.
2.Sort
The framework merge sorts Reducer inputs by keys (since different Mappers may
have output the same key). The shuffle and sort phases occur simultaneously i.e. while
outputs
are
being
fetched they
are
merged.
SecondarySort
To achieve a secondary sort on the values returned by the value iterator, the
application should extend the key with the secondary key and define a grouping comparator.
The keys will be sorted using the entire key, but will be grouped using the
grouping comparator to decide which keys and values are sent in the same call to reduce.
3.Reduce
In this phase the reduce(Object, Iterable, Context) method is called for each(collection of
values)> in the sorted inputs. The output of the reduce task is typically written to a
RecordWriter via TaskInputOutputContext.write(Object, Object). The output of the Reducer
is
not
re-sorted
You wrote a map function that throws a runtime exception when it encounters a
control character in input data. The input supplied to your mapper contains twelve
such characters totals, spread across five file splits. The first four file splits each
have two control characters and the last split has four control characters.
Indentify the number of failed task attempts you can expect when you run the job
with
mapred.max.map.attempts
set
to
4:
A.
You
will
have
forty-eight
failed
task
attempts
B.
You
will
have
seventeen
failed
task
attempts
C.
You
will
have
five
failed
task
attempts
D.
You
will
have
twelve
failed
task
attempts
E.
You
will
have
twenty
failed
task
attempts
Answer:
E
Explanation:
There
will
be
four
failed
task
attempts
for
each
of
the
five
file
splits.
combine
map
init
configure
D
Explanation:
See 3) below. Here is an illustrative example on how to use the DistributedCache:
//
Setting
up
the
cache
for
the
application
1.
Copy
the
requisite
files
to
the
FileSystem:
$
bin/hadoop
fs
-copyFromLocal
lookup.dat
/myapp/lookup.dat
$
bin/hadoop
fs
-copyFromLocal
map.zip
/myapp/map.zip
$
bin/hadoop
fs
-copyFromLocal
mylib.jar
/myapp/mylib.jar
$
bin/hadoop
fs
-copyFromLocal
mytar.tar
/myapp/mytar.tar
$
bin/hadoop
fs
-copyFromLocal
mytgz.tgz
/myapp/mytgz.tgz
$
bin/hadoop
fs
-copyFromLocal
mytargz.tar.gz
/myapp/mytargz.tar.gz
2.
Setup
the
application's
JobConf:
JobConf
job
=
new
JobConf();
DistributedCache.addCacheFile(new
URI("/myapp/lookup.dat#lookup.dat"),
job);
DistributedCache.addCacheArchive(new
URI("/myapp/map.zip",
job);
DistributedCache.addFileToClassPath(new
Path("/myapp/mylib.jar"),
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytar.tar",
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytgz.tgz",
job);
DistributedCache.addCacheArchive(new
URI("/myapp/mytargz.tar.gz",
job);
3. Use the
extends
private
private
public
//
localArchives
localFiles
}
public void
throws
//
Use
//
//
}
cached
map(K
key,
data
V
from
...
value,
reporter)
{
here
...
v);
}
Youve written a MapReduce job that will process 500 million input records
and generated 500 million key-value pairs. The data is not uniformly distributed.
Your MapReduce job will create a significant amount of intermediate data that it
needs to transfer between mappers and reduces which is a potential bottleneck. A
custom implementation of which interface is most likely to reduce the amount
of intermediate
data
transferred
across
the
network?
A.
B.
C.
D.
E.
F.
Answer:
Partitioner
OutputFormat
WritableComparable
Writable
InputFormat
Combiner
F
Explanation:
Combiners are used to increase the efficiency of a MapReduce program. They are used to
aggregate intermediate map output locally on individual mapper outputs. Combiners can
help you reduce the amount of data that needs to be transferred across to the reducers. You
can use your reducer code as a combiner if the operation performed is commutative and
associative.
Can you use MapReduce to perform a relational join on two large tables sharing
a key? Assume that the two tables are formatted as comma-separated files in
HDFS.
A.
Yes.
B.
Yes,
but
only
if
one
of
the
tables
fits
into
memory
C.
Yes,
so
long
as
both
tables
fit
into
memory.
D.
No,
MapReduce
cannot
perform
relational
operations.
E.
No,
but
it
can
be
done
with
either
Pig
or
Hive.
Answer:
A
Explanation:
Join Algorithms in MapReduce
Reduce-side join
Map-side join
You''ve written a MapReduce job that will process 500 million input records and
generated 500 million key-value pairs. The data is not uniformly distributed. Your
MapReduce job will create asignificant amount of intermediate data that it needs
to transfer between mappers and reduces which is a potential bottleneck. A
custom implementation of which interface is most likely to reduce the amount of
intermediate
data
transferred
across
the
network?
A.
Partitioner
B.
OutputFormat
C.
WritableComparable
D.
Writable
E.
InputFormat
F.
Combiner
You have just executed a MapReduce job. Where is intermediate data written to
after
being emitted
from
the
Mapper''s
map
method?
A. Intermediate data in streamed across the network from Mapper to the Reduce and is
never written
to
disk.
B. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and
are written
into
HDFS.
C. Into in-memory buffers that spill over to the local file system of the TaskTracker node
running the
Mapper.
D. Into in-memory buffers that spill over to the local file system (outside HDFS) of the
TaskTracker node
running
the
Reducer
E. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and
are written
into
HDFS.
You want to understand more about how users browse your public website, such
as which pages they visit prior to placing an order. You have a farm of 200 web
servers hosting your website. How will you gather this data for your analysis?
1.
2.
Write a MapReduce job, with the web servers for mappers, and the Hadoop
cluster nodes forreduces.
3.
Import all users'' clicks from your OLTP databases into Hadoop, using Sqoop.
4.
5.
curl.
Sample the weblogs from the web servers, copying them into Hadoop using
MapReduce
v2
(MRv2/YARN)
is
designed
to
address
which
two
issues?
A.
Single
point
of
failure
in
the
NameNode.
B.
Resource
pressure
on
the
JobTracker.
C.
HDFS
latency.
D.
Ability
to
run
frameworks
other
than
MapReduce,
such
as
MPI.
E.
Reduce
complexity
of
the
MapReduce
APIs.
F.
Standardize
on
a
single
MapReduce
API.
You need to run the same job many times with minor variations. Rather than
hardcoding all job configuration options in your drive code, you''ve decided to
have your Driver subclass org.apache.hadoop.conf.Configured and implement the
org.apache.hadoop.util.Tool interface. Indentify which invocation correctly
passes.mapred.job.name
with
a
value
of
Example
to
Hadoop?
A.
B.
C.
D.
E.
hadoop
''mapred.job.name=Example''
MyDriver
input
hadoop
MyDriver
mapred.job.name=Example
input
hadoop
MyDrive
''D
mapred.job.name=Example
input
hadoop
setproperty
mapred.job.name=Example
MyDriver
input
hadoop
setproperty
(''mapred.job.name=Example'')
MyDriver
input
output
output
output
output
output
You are developing a MapReduce job for sales reporting. The mapper will process
input keys representing the year (IntWritable) and input values representing
product indentifies (Text). Indentify what determines the data types used by the
Mapper
for
given
job.
1.
The key and value types specified in the JobConf.setMapInputKeyClass
and JobConf.setMapInputValuesClass methods
2.
3.
The mapper-specification.xml file submitted with the job determine the
mapper''s input key and value types.
4.
The InputFormat used by the job determines the mapper''s input key and
value types.
For
each
input
key-value
pair,
mappers
can
emit:
A. As many intermediate key-value pairs as designed. There are no restrictions on the types
of those
key-value
pairs
(i.e.,
they
can
be
heterogeneous).
B. As many intermediate key-value pairs as designed, but they cannot be of the same type
as
the input
key-value
pair.
C.
One
intermediate
key-value
pair,
of
a
different
type.
D.
One
intermediate
key-value
pair,
but
of
the
same
type.
E. As many intermediate key-value pairs as designed, as long as all the keys have the same
types and
all
the
values
have
the
same
type.
You have
(the,
(fox,
the
following
key-value
pairs
as
output
from
your
Map
task:
1)
1)
(faster,
(than,
(the,
(dog,
How many
A.
B.
C.
D.
E.
F.
keys
will
be
passed
to
the
Reducer''s
reduce
1)
1)
1)
1)
method?
Six
Five
Four
Two
One
Three
You have user profile records in your OLPT database, that you want to join with
web logs you have already ingested into the Hadoop file system. How will you
obtain
these
user
records?
A.
HDFS
command
B.
Pig
LOAD
command
C.
Sqoop
import
D.
Hive
LOAD
DATA
command
E.
Ingest
with
Flume
agents
F.
Ingest
with
Hadoop
Streaming
Which two updates occur when a client application opens a stream to begin a file
write
on
a
cluster running
MapReduce
v1
(MRv1)?
A. Once the write stream closes on the DataNode, the DataNode immediately initiates a
black
report
to
the
NameNode.
B.
The
change
is
written
to
the
NameNode
disk.
C.
The
metadata
in
the
RAM
on
the
NameNode
is
flushed
to
disk.
D.
The
metadata
in
RAM
on
the
NameNode
is
flushed
disk.
E.
The
metadata
in
RAM
on
the
NameNode
is
updated.
F.
The
change
is
written
to
the
edits
file.
Answer
There are always at least as many task attempts as there are tasks.
There are always at most as many tasks attempts as there are tasks.
There are always exactly as many task attempts as there are tasks.
The developer sets the number of task attempts on job submission.
A. The NameNode forces re-replication of all the blocks which were stored on the dead
DataNode.
B. The next time a client submits job that requires blocks from the dead DataNode, the
JobTracker receives no heart beats from the DataNode. The JobTracker tells the NameNode
that the DataNode is dead, which triggers block re-replication on the cluster.
C. The replication factor of the files which had blocks stored on the dead DataNode is
temporarily reduced, until the dead DataNode is recovered and returned to the cluster.
D. The NameNode informs the client which write the blocks that are no longer available; the
client
then
re-writes
the
blocks
to
a
different
DataNode.
How does the NameNode know DataNodes are available on a cluster running
MapReduce
v1(MRv1)
A. DataNodes listed in the dfs.hosts file. The NameNode uses as the definitive list of
available
DataNodes.
B.
DataNodes
heartbeat
in
the
master
on
a
regular
basis.
C. The NameNode broadcasts a heartbeat on the network on a regular basis, and DataNodes
respond.
D. The NameNode send a broadcast across the network when it first starts, and DataNodes
respond.
Which three distcp features can you utilize on a Hadoop cluster?
A. Use distcp to copy files only between two clusters or more. You cannot use distcp to copy
data
between
directories
inside
the
same
cluster.
B.
Use
distcp
to
copy
HBase
table
files.
C. Use distcp to copy physical blocks from the source to the target destination in your
cluster.
D. Use distcp to copy data between directories inside the same cluster.
E.
Use
distcp
to
run
an
internal
MapReduce
job
to
copy
files.
How
does
HDFS
Federation
help
HDFS
Scale
horizontally?
A. HDFS Federation improves the resiliency of HDFS in the face of network issues by
removing
the
NameNode
as
a
single-point-of-failure.
B. HDFS Federation allows the Standby NameNode to automatically resume the services of
an
active
NameNode.
C. HDFS Federation provides cross-data center (non-local) support for HDFS, allowing a
cluster administrator to split the Block Storage outside the local cluster.
D. HDFS Federation reduces the load on any single NameNode by using the multiple,
independent NameNode to manage individual pars of the filesystem namespace.
Choose which best describe a Hadoop cluster's block size storage parameters once
you
set
the
HDFS
default
block
size
to
64MB?
A. The block size of files in the cluster can be determined as the block is written.
B. The block size of files in the Cluster will all be multiples of 64MB.
C. The block size of files in the duster will all at least be 64MB.
D. The block size of files in the cluster will all be the exactly 64MB.
Which MapReduce daemon instantiates user code, and executes map and reduce
tasks
on
a cluster
running
MapReduce
v1
(MRv1)?
A.
NameNode
B.
DataNode
C.
JobTracker
D.
TaskTracker
E.
ResourceManager
F.
ApplicationMaster
G.
NodeManager
What two processes must you do if you are running a Hadoop cluster with a single
NameNode and six DataNodes, and you want to change a configuration parameter
so
that
it
affects
all
six
DataNodes.
A. You must restart the NameNode daemon to apply the changes to the cluster
B. You must restart all six DataNode daemons to apply the changes to the cluster.
C. You don't need to restart any daemon, as they will pick up changes automatically.
D. You must modify the configuration files on each of the six DataNode machines.
E. You must modify the configuration files on only one of the DataNode machine
F. You must modify the configuration files on the NameNode only. DataNodes read their
configuration
from
the
master
nodes.
Identify the function performed by the Secondary NameNode daemon on a cluster
configured
to run
with
a
single
NameNode.
A. In this configuration, the Secondary NameNode performs a checkpoint operation on the
files
by
the
NameNode.
B. In this configuration, the Secondary NameNode is standby NameNode, ready to failover
and
provide
high
availability.
C. In this configuration, the Secondary NameNode performs deal-time backups of the
NameNode.
D. In this configuration, the Secondary NameNode servers as alternate data channel for
clients
to
reach
HDFS,
should
the
NameNode
become
too
busy.
You install Cloudera Manager on a cluster where each host has 1 GB of RAM. All of
the services show their status as concerning. However, all jobs submitted
complete without an error. Why is Cloudera Manager showing the concerning
status
KM
the
services?
A.
A
slave
node's
disk
ran
out
of
space
B.
The
slave
nodes,
haven't
sent
a
heartbeat
in
60
minutes
C.
The
slave
nodes
are
swapping.
D.
DataNode
service
instance
has
crashed.
What is the recommended disk configuration for slave nodes in your Hadoop
cluster
with
6
x
2
TB hard
drives?
A.
RAID
10
B.
JBOD
C.
RAID
5
D.
RAID
1+0
You configure you cluster with HDFS High Availability (HA) using Quorum-Based
storage.
You
do not
implement
HDFS
Federation.
What is the maximum number of NameNodes daemon you should run on you
cluster in order to avoid a ''split-brain'' scenario with your NameNodes?
A. Unlimited. HDFS High Availability (HA) is designed to overcome limitations on the number
of
NameNodes
you
can
deploy.
B.
Two
active
NameNodes
and
one
Standby
NameNode
C.
One
active
NameNode
and
one
Standby
NameNode
D.
Two
active
NameNodes
and
two
Standby
NameNodes
You configure Hadoop cluster with both MapReduce frameworks, MapReduce v1
(MRv1) and MapReduce v2 (MRv2/YARN). Which two MapReduce (computational)
daemons do you need to configure to run on your master nodes?
A.
JobTracker
B.
ResourceManager
C.
ApplicationMaster
D.
JournalNode
E.
NodeManager
You observe that the number of spilled records from map tasks for exceeds the
number of map output records. You child heap size is 1 GB and your io.sort.mb
value is set to 100MB. How would you tune your io.sort.mb value to achieve
maximum
memory
to
disk
I/O
ratio?
A. Tune io.sort.mb value until you observe that the number of spilled records equals (or is
as
close
to
equals)
the
number
of
map
output
records.
B.
Decrease
the
io.sort.mb
value
below
100MB.
C. Increase the IO.sort.mb as high you can, as close to 1GB as possible.
D. For 1GB child heap size an io.sort.mb of 128MB will always maximum memory to disk
I/O.
Your Hadoop cluster has 25 nodes with a total of 100 TB (4 TB per node) of raw
disk space allocated HDFS storage. Assuming Hadoop's default configuration, how
much
data
will
you
be
able
to
store?
A.
Approximately
100TB
B.
Approximately
25TB
C.
Approximately
10TB
D.
Approximately
33
TB
You set up the Hadoop cluster using NameNode Federation. One NameNode
manages the/users namespace and one NameNode manages the/data namespace.
what
is
MAP
REDUCE?
Map Reduce is a set of programs used to access and manipulate large data sets over a
Hadoop
cluster.
What
is
the
InputSplit
in
map
reduce
software?
is
meaning
Replication
factor?
Replication factor defines the number of times a given data block is stored in the cluster.
The default replication factor is 3. This also means that you need to have 3times the amount
of storage needed to store the data. Each file is split into data blocks and spread across the
cluster.
what
is
the
default
replication
factor
in
HDFS?
The default hadoop comes with 3 replication factor. You can set the replication level
individually for each file in HDFS. In addition to fault tolerance having replicas allow jobs
that consume the same data to be run in parallel. Also if there are replicas of the data
hadoop can attempt to run multiple copies of the same task and take which ever finishes
first.
This
is
useful
if
for
some
reason
a
box
is
being
slow.
Most Hadoop administrators set the default replication factor for their files to be three. The
main assumption here is that if you keep three copies of the data, your data is safe. this to
be
true
in
the
big
clusters
that
we
manage
and
operate.
In addition to fault tolerance having replicas allow jobs that consume the same data to be
run in parallel. Also if there are replicas of the data hadoop can attempt to run multiple
copies of the same task and take which ever finishes first. This is useful if for some reason a
box
is
being
slow.
what
is
Default
the
typical
blocksize
is
block
64mb.
size
of
But
an
128mb
HDFS
is
block?
typical.
what
is
namenode?
Name node is one of the daemon that runs in Master node and holds the meta info where
particular chunk of data (ie. data node ) resides.Based on meta info maps the incoming job
to
corresponding
data
node.
How
does
Totally
master
daemons
slave
run
in
architecture
Hadoop
in
Master-slave
the
Hadoop?
architecture
On Master Node : Name Node and Job Tracker and Secondary name node
On
Slave
:
Data
Node
and
Task
Tracker
But its recommended to run Secondary name node in a separate machine which have
Master
node
capacity.
What
is
I
do
Distributed
Distributed
Name Node
Explain
compute
define
Processing
how
input
and
and
Hadoop
Storage
into
:
Storage
and Data
output
ways
Map
holds
data
exact
format
of
nodes?
:
Reduce
HDFS
its MR program.
:
data and
the
Hadoop
framework?
Fileinputformat,
textinputformat,
keyvaluetextinputformat,
sequencefileinputformat,
sequencefileasinputtextformat, wholefileformat are file formats in hadoop framework
How
can
we
control
By
using
What
is
Reducer
What
is
particular
key
to
are
combine
the
Reducer
has
What
happens
go
if
the
specific
multiple
phases:
number
used
outputs
phases
of
of
shuffle,
of
reducer?
partitioner.
Reducer
primary
primary
in
custom
the
used
should
for?
mapper
the
sort
to
Reducer?
and
reducers
one.
are
reduce.
0?
many
instances
of
JobTracker
can
run
on
Hadoop
Cluser?
One. There can only be one JobTracker in the cluster. This can be run on the same machine
running
the
NameNode.
How
NameNode
Handles
data
node
failures?
Through checksums. every data has a record followed by a checksum. if checksum doesnot
match
with
the
original
then
it
reports
an
data
corrupted
error.
Can
set
the
number
of
reducers
to
zero?
can be given as zero. So, the mapper output is an finalised output and stores in HDFS.
What
is
SequenceFile
in
Hadoop?
D
there
map
A.
Yes,
but
B.
Yes,
there
is
C.
No,
but
sequence
D.
Both
2
Answers:
input
a
file
and
format
in
Hadoop?
only
in
Hadoop
special
format
for
input
format
can
read
3
are
correct
0.22+.
map
files.
map
files.
answers.
C
What happens if mapper output does not match reducer input in Hadoop?
A. Hadoop API will convert the data to the type that is needed by the reducer.
B. Data input/output inconsistency cannot occur. A preliminary validation check is executed
prior to the full execution of the job to ensure there is consistency.
C. The java compiler will report an error during compilation but the job will complete with
exceptions.
D.
A real-time exception will be thrown and map-reduce job will fail.
Answer:
Can
A.
B.
C.
D.
you
D
provide
Yes,
No,
Yes,
Yes,
Answer:
multiple
input
but
Hadoop
always
developers
can
but
the
limit
is
paths
to
only
operates
add
any
currently
map-reduce
jobs
Hadoop?
in
Hadoop
0.22+.
on
one
input
directory.
number
of
input
paths.
capped
at
10
input
paths.
C
A.
No,
Hadoop
does
not
provide
techniques
for
custom
datatypes.
B.
Yes,
but
only
for
mappers.
C. Yes, custom data types can be implemented as long as they implement writable
interface.
D.
Yes,
but
only
for
reducers.
Answer:
The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable.
They have almost the same features as default java classes. What are these
writable
data
types
optimized
for?
A.
B.
C.
D.
Answer:
What
is
writable
in
Hadoop?
A. Writable is a java interface that needs to be implemented for streaming data to remote
servers.
B. Writable is a java interface that needs to be implemented for HDFS writes.
C. Writable is a java interface that needs to be implemented for MapReduce processing.
D.
None
of
these
answers
are
correct.
Answer:
What
is
C
the
best
performance
one
can
expect
from
Hadoop
cluster?
A. The best performance expectation one can have is measured in seconds. This is because
Hadoop
can
only
be
used
for
batch
processing
B. The best performance expectation one can have is measured in milliseconds. This is
because
Hadoop
executes
in
parallel
across
so
many
machines
C. The best performance expectation one can have is measured in minutes. This is
because
Hadoop
can
only
be
used
for
batch
processing
D. It depends on on the design of the map-reduce program, how many machines in the
cluster,
and
the
amount
of
data
being
retrieved
Answer:
What
A
is
distributed
cache
in
Hadoop?
A. The distributed cache is special component on namenode that will cache frequently used
data
for
faster
client
response.
It
is
used
during
reduce
step.
B. The distributed cache is special component on datanode that will cache frequently used
data
for
faster
client
response.
It
is
used
during
map
step.
C.
The
distributed
cache
is
a
component
that
caches
java
objects.
D. The distributed cache is a component that allows developers to deploy jars for MapReduce
processing.
Answer:
Can
you
run
Map
Reduce
jobs
directly
on
Avro
data
in
Hadoop?
A.
Yes, Avro was specifically designed for data processing via Map-Reduce
B.
Yes,
but
additional
extensive
coding
is
required
C.
No,
Avro
was
specifically
designed
for
data
storage
only
D. Avro specifies metadata that allows easier data access. This data cannot be used as
part
of
map-reduce
execution,
rather
input
specification
only.
Answer:
What
A.
B.
C.
D.
is
Avro
Avro
Avro
is
None
AVRO
is
is
a
a
a
java
of
library
these
in
Hadoop?
java
serialization
java
compression
that
create
splittable
answers
are
Answer:
library
library
files
correct
A
Will settings using Java API overwrite values in configuration files in Hadoop?
A.
No. The configuration settings in the configuration file takes precedence
B.
Yes.
The
configuration
settings
using
Java
API
take
precedence
C. It depends when the developer reads the configuration file. If it is read first then no.
D. Only global configuration settings are captured in configuration files on namenode.
There are only a very few job parameters that can be set using Java API.
Answer:
Which
B
is
faster:
Map-side
join
or
Reduce-side
join?
Why?
A.
Both techniques have about the the same performance expectations.
B.
Reduce-side
join
because
join
operation
is
done
on
HDFS.
C.
Map-side join is faster because join operation is done in memory.
D. Reduce-side join because it is executed on a the namenode which will have faster CPU
and
more
memory.
Answer:
What
C
are
the
common
problems
with
map-side
join
in
Hadoop?
A. The most common problem with map-side joins is introducing a high level of code
complexity. This complexity has several downsides: increased risk of bugs and performance
degradation.
Developers
are
cautioned
to
rarely
use
map-side
joins.
B. The most common problem with map-side joins is lack of the avaialble map slots since
map-side
joins
require
a
lot
of
mappers.
C. The most common problems with map-side joins are out of memory exceptions on slave
nodes.
D. The most common problem with map-side join is not clearly specifying primary index in
the
join.
This
can
lead
to
very
slow
performance
on
large
datasets.
Answer:
How
C
can
you
overwrite
the
default
input
format
in
Hadoop?
A. In order to overwrite default input format, the Hadoop administrator has to change
default
settings
in
config
file.
B. In order to overwrite default input format, a developer has to set new input format on
job
config
before
submitting
the
job
to
a
cluster.
C. The default input format is controlled by each individual mapper and each line needs to
be
parsed
indivudually.
D.
None
of
these
answers
are
correct.
Answer:
What
B
is
the
default
input
format
in
Hadoop?
A. The default input format is xml. Developer can specify other input formats as
appropriate
if
xml
is
not
the
correct
input.
B. There is no default input format. The input format always should be specified.
C. The default input format is a sequence file format. The data needs to be preprocessed
before
using
the
default
input
format.
D. The default input format is TextInputFormat with byte offset as a key and entire line as
a
value.
Answer:
Why would a developer create a map-reduce without the reduce step Hadoop?
A. Developers should design Map-Reduce jobs without reducers only if no reduce slots are
available
on
the
cluster.
B. Developers should never design Map-Reduce jobs without reducers. An error will occur
upon
compile.
C. There is a CPU intensive step that occurs between the map and reduce steps. Disabling
the
reduce
step
speeds
up
data
processing.
D. It is not possible to create a map-reduce job without at least one reduce step. A
developer
may
decide
to
limit
to
one
reducer
for
debugging
purposes.
Answer:
How
C
can
you
disable
the
reduce
step
in
Hadoop?
A. The Hadoop administrator has to set the number of the reducer slot to zero on all slave
nodes.
This
will
disable
the
reduce
step.
B. It is imposible to disable the reduce step since it is critical part of the Mep-Reduce
abstraction.
C. A developer can always set the number of the reducers to zero. That will completely
disable
the
reduce
step.
D. While you cannot completely disable reducers you can set output to one. There needs to
be
at
least
one
reduce
step
in
Map-Reduce
abstraction.
Answer:
What
is
PIG?
in
Hadoop
A.
Pig
is
a
subset
fo
the
Hadoop
API
for
data
processing
B. Pig is a part of the Apache Hadoop project that provides C-like scripting languge
interface
for
data
processing
C. Pig is a part of the Apache Hadoop project. It is a "PL-SQL" interface for data processing
in
Hadoop
cluster
D. PIG is the third most popular form of meat in the US behind poultry and beef.
Answer:
What
B
is
reduce
side
join
in
Hadoop?
A. Reduce-side join is a technique to eliminate data from initial data set at reduce step
B. Reduce-side join is a technique for merging data from different sources based on a
specific
key.
There
are
no
memory
restrictions
C. Reduce-side join is a set of API to merge data from different sources.
D.
None
of
these
answers
are
correct
Answer:
What
A.
B.
B
is
map
side
join
in
Hadoop?
Map-side
join
is
done
in
the
map
phase
and
done
in
memory
Map-side join is a technique in which data is eliminated at the map step
C. Map-side join is a form of map-reduce API which joins data from different locations
D.
None
of
these
answers
are
correct
Answer:
How
A
can
you
use
binary
data
in
MapReduce
in
Hadoop?
A. Binary data can be used directly by a map-reduce job. Often binary data is added to a
sequence
file.
B. Binary data cannot be used by Hadoop fremework. Binary data should be converted to a
Hadoop
compatible
format
prior
to
loading.
C. Binary can be used in map-reduce only with very limited functionlity. It cannot be used
as
a
key
for
example.
D. Hadoop can freely use binary files with map-reduce jobs so long as the files have
headers
Answer:
What
A
are
map
files
and
why
are
they
important
in
Hadoop?
A. Map files are stored on the namenode and capture the metadata for all blocks on a
particular
rack.
This
is
how
Hadoop
is
"rack
aware"
B. Map files are the files that show how the data is distributed in the Hadoop cluster.
C. Map files are generated by Map-Reduce after the reduce step. They show the task
distribution
during
job
execution
D. Map files are sorted sequence files that also have an index. The index allows fast data
look
up.
Answer:
What
D
are
sequence
files
and
why
are
they
important
in
Hadoop?
A. Sequence files are binary format files that are compressed and are splitable. They are
often
used
in
high-performance
map-reduce
jobs
B. Sequence files are a type of the file in the Hadoop framework that allow data to be
sorted
C. Sequence files are intermediate files that are created by Hadoop after the map step
D.
Both
B
and
C
are
correct
Answer:
How
A.
B.
C.
many
A
states
does
Writable
interface
defines
___
in
Hadoop?
Two
Four
Three
D.
None
of
the
above
Answer:
Which method of the FileSystem object is used for reading a file in HDFS in
Hadoop?
A.
B.
C.
D.
None
of
open()
access()
select()
above
the
Answer:
RPC
means______.
A.
B.
C.
D.
Remote
Remote
Remote
None
in
Hadoop?
processing
process
procedure
the
of
call
call
call
above
Answer:
The
switch
A.
B.
C.
D.
given
to
hadoop
None
fs
command
of
for
detailed
A.
B.
C.
D.
Answer:
-show
-help
-?
above
the
Answer:
The
help
B
size
of
None
block
in
512
64
1024
of
HDFS
in
the
hadoop?
bytes
MB
KB
above
B
Split
Map
C.
Combine
Ans:
What
A.
B.
C.
Ans:
is
One
One
the
key
key
An
input
and
a
list
and
a
list
arbitrarily
to
the
of
all
of
some
sized
Reduce
values
values
list
function
in
Hadoop?
associated
with
that
key.
associated
with
that
key.
of
key/value
pairs.
A
Hadoop
Twister
Phoenix
C
Java
C
FORTRAN
Python
A
The Combine stage, if present, must perform the same aggregation operation as
Reduce
?
A.
True
B.
False
Ans:
Which MapReduce stage serves as a barrier, where all previous stages must be
completed
before
it
may
proceed?
A.
B.
C.
D.
Group
Combine
'shuffle')
Reduce
Write
(a.k.a.
Ans:
Which
TACC
resource
has
support
for
Hadoop
MapReduce?
A.
B.
C.
D.
Ranger
Longhorn
Lonestar
Spur
Ans:
Which
A
of
the
A.
B.
C.
D.
E.
following
scenarios
makes
HDFS
unavailable
in
JobTracker
TaskTracker
DataNode
NameNode
Secondary
failure
failure
failure
failure
failure
NameNode
Answer:
Which
A.
B.
C.
D.
Ans:
Hadoop?
A
TACC
resource
has
support
for
Hadoop
MapReduce
in
Hadoop?
Ranger
Longhorn
Lonestar
Spur
A
Which MapReduce stage serves as a barrier, where all previous stages must be
completed
before
it
may
proceed
in
Hadoop?
A.
Combine
B.
C.
D.
Group
(a.k.a.
'shuffle')
Reduce
Write
Ans:
Which
of
the
A.
B.
C.
D.
E.
following
scenarios
makes
HDFS
JobTracker
TaskTracker
DataNode
NameNode
Secondary
NameNode
Answer:
Map
in
Hadoop?
failure
failure
failure
failure
failure
A
unavailable
or
reduce
tasks
HDFS
The
A
DataNode
MapReduce
jobs
that
that
in
an
infinite
loop.
is
almost
full.
NameNode
goes
down.
is
disconnectedfrom
the
cluster.
are
causing
excessive
memory
swaps.
Answer:
are
stuck
Which of the following utilities allows you to create and run MapReduce jobs with
any
executable
or
script
as
the
mapper
and/or
the
reducer?
A.
B.
C.
D.
Answer:
Hadoop
Oozie
Sqoop
Flume
Streaming
D
You need a distributed, scalable, data Store that allows you random, realtime
read/write access to hundreds of terabytes of data. Which of the following would
you
use
in
Hadoop?
A.
B.
C.
D.
E.
Hue
Pig
Hive
Oozie
HBase
F.
G.
Flume
Sqoop
Answer:
Workflows
E
expressed
in
Oozie
can
contain
in
Hadoop?
You have an employee who is a Date Analyst and is very comfortable with SQL. He
would like to run ad-hoc analysis on data in your HDFS duster. Which of the
following is a data warehousing software built on top of Apache Hadoop that
defines a simple SQL-like query language well-suited for this kind of user?
A.
B.
C.
D.
E.
F.
G.
Answer:
Hadoop
Pig
Hue
Hive
Sqoop
Oozie
Flume
Streaming
C
In a MapReduce job, you want each of you input files processed by a single map
task. How do you configure a MapReduce job so that a single map task processes
each input file regardless of how many blocks the input file occupies?
A. Increase the parameter that controls minimum split size in the job configuration.
B. Write a custom MapRunner that iterates over all key-value pairs in the entire file.
C. Set the number of mappers equal to the number of input files you want to process.
D. Write a custom FileInputFormat and override the method isSplittable to always return
false.
Answer:B
Which of the following best describes the workings of TextInputFormat in
Hadoop?
A. Input file splits may cross line breaks. A line thatcrosses tile splits is ignored.
B. The input file is split exactly at the line breaks, so each Record Reader will read a series
of
complete
lines.
C. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders
of
both
splits
containing
the
brokenline.
D. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader
of
the
split
that
contains
the
end
of
the
brokenline.
E. Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReader
of
the
split
that
contains
the
beginningof
thebroken
line.
Answer: D