Hdfs

Q1 Which of these storage systems can Hadoop work with?
a. HDFS
b. AWS S3
c. EMC’s Isilon
d. All of the above
Sol- (d) All of the above
HDFS is Hadoop’s flagship file system but architecturally Hadoop has a general purpose file
system abstraction and hence a wide variety of storage systems can be put in use. Cleversafe
and NetApp’s Open Solution for Hadoop (OSH) are two most popular storage systems used with
Hadoop.
Q2. Which of the following are true for Hadoop Distributed File System?
1. Write Once Read Many data processing pattern is used.

2. Meant to work with commodity hardware.
3. It is not meant for low-latency data access.
4. It is not efficient when there are a large number of files to be handled.
a. 1&2
b. 1, 2 & 3
c. 2&3
d. All of the above
HDFS is meant for high throughput data access and hence low-latency applications don’t find it
suitable. HBase is a good alternative for the same. Also, please keep in mind that HDFS is meant
to handle files of huge size and not huge number of files. This is because namenode holds file
system’s metadata. So, the amount of file that a namenode can register to its memory depends
on the size of the memory available to the namenode.
1. Files in HDFS are broken into block sized chunks which are stored as independent units.
2. Just like disk based file systems, files in HDFS which are smaller than the block size end up
making the block redundant for any other storage.
3. Having a large block size helps reduce the data seek time w.r.t data transfer time.
4. The default block size is 128 MB.
a. All of the above.
b. Only 1
c. 1 & 4
d. 1, 3 & 4
Sol- (d) 1, 3 & 4

HDFS blocks, unlike disk based file system files smaller than its size doesn’t occupy a full-block’s
worth of underlying storage. Say if a file is of 8 MB size, then it will not hold up the rest 120 MB
space.
Q4. What are the advantages of using block based abstraction in HDFS?
a. Blocks make distributed storage of the large files possible.

b. Blocks simplify storage management and eliminate the concern of file metadata.
c. Block fit well with replication for providing fault tolerance and availability.
d. All the above.
a. It has two types of nodes working in a master-worker pattern: a namenode (master) and a
number of disknodes (worker).
b. The namenode maintains the file system namespace while metadata for all the files and
directories in the in the file system tree.
c. The overall maintenance of file system tree is handled by respective datanodes.
d. The namenode knows the datanodes on which all the blocks of a given file are located along
with the block location.
Sol- (b) The namenode maintains the file system namespace while metadata for all the files and
directories in the in the file system tree.
The memory location of the blocks containing all the segments of given file is not known to
namenode and everything pertaining to the maintenance of the distributed file system is
associated with the namenode.
Q6. Which of the following are true for Namenodes?
1. The information pertaining to namenode is stored persistently in the local disk system in the
form of two files name space image and edit log.
2. A client accesses the file system on behalf of the user by communicating with the namenode
only.
3. Without namenode a file system can’t be used and hence if the hardware on with
namenode is operational malfunctions, there will be a complete loss of all the files.
4. The secondary namenode can act as the namenode in case of machine failure.
a. All of the above.
b. 2, 3 & 4
c. 2 & 3
d. 1 & 3
Sol- (d) 1 & 3
Clients do interact with datanodes as well. The secondary namenode is not a stand-by
namenode rather an optimiser of the edit log. It periodically merges the namespace image and
the edit log and keeps a copy of the merged namespace image which can be used in case of
failure of the primary to arrest complete data loss.
Q7. What is block cache?
Sol- Normally a datanode reads blocks from disk but for frequently accessed files the blocks may be
explicitly cached I the datanode’s memory. By default a block is cached in only one datanode’s
memory however the number is configurable on per-file basis.
Q8. What is HDFS Federation?
Sol- It is creation of a mult-namenode architecture to ensure horizontal scalability which is key to

management of large number of files. Each of the namenode manage a portion of the file system
and the failure of one namenode doesn’t affect the other.
Q9. What is the need behind the concept of HDFS High Availability Architecture?
Sol- In the event of any namenode failure, another namenode has to be made available to restore
the service. Now for the new namenode to go online it requires all the namespace images to be
loaded into the memory, replaying of all the edit logs and receiving enough block reports from the
datanodes to leave safe mode. This may take as long as 30 mins for a very large file system. To
reduce this restoration time, HDFS High Availability Architecture was brought in.
Q10. What architectural changes have been have been undertaken to ensure High Availability?
Sol- The main changes registered are-
a. The namenodes must use highly available shared storage to share the edit log.
b. The datanodes must send block reports to both namenodes because the block mappings are
stored in a namenode’s memory, and not on disk.
c. Client must be configured to handle a namenode failure using a mechanism that is
transparent to users.
d. The secondary namenodes role is subsumed by the standby, which takes periodic
checkpoints of the active namenode’s namespace.
Q11. What is QJM?
Sol- QJM stands for Quoram Journal Manager. It is a dedicated HDFS implementation, designed for
the sole purpose of having high availability edit log, and is the recommended choice for most HDFS
installation. QJM runs as a group of Journal nodes and each edit must be written to a majority of the
journal nodes. Typically there are three journal nodes so that the system can tolerate the loss of one
of them.
Q12. What is a failover controller?
Sol- In the QJM arrangement of HDFS implementation, the failover controller is an entity that
manages the transition from the active namenode to the standby namenode. Each namenode runs a
lightweight failover controller whose main job is to monitor the host namenode for failures using a
simple heartbeat mechanism and trigger a failover should it fail. There are many failover controllers
but ZooKeeper is used by the default implementation.
Q13. What is graceful failover?
Failovers which are administered by the system administrator especially during routine maintenance
cycles are known as graceful failover. It’s called so because the failover controller can arrange the
orderly transition for both namenodes to switch roles.
Q14. Why is fencing needed in HDFS High Availability implementations?
It’s impossible to ascertain that the failed namenode has stopped working, for the failover
controller. Hence it might happen that after the stand-by namenode goes online, the primary
namenode to mimic active namenode properties and corrupt data. This is prevented by fencing
property of High Availability architecture.
Q15. Why is QJM more popular than NFS filer?
QJM has inbuilt fencing property to ensure that only one namenode writes at a time where is it’s not
so in case of NFS filer. Hence stronger fencing methods are used for NFS Filer.
Q16. What are common fencing strategies?
Sol- QJM allows one namenode to write to the edit log at a time. SSH fencing command command
can prevent a previously active namenode from sending stale ready request to the client by killing
the process.
In case of NFS filer, the namenodes access to the shared storage directory can be revoked via
vendor-specific NFS command. Lastly the STONITH (Shoot The Other Node In The Head) technique
can be employed where a specialised power distribution unit is used which focebly shuts down the
namenode once a stand-by namenode is made active.
Q17. In HDFS, client failover is handled transparently by the _____________.
Sol- client library
Q18. HDFS has a permission model for files and directories that is much like the ___________.
Sol- POSIX Model
Q19. Which of the following is not true for the permission model of HDFS?
a. There are three types of permissions; (r) read, (w) write and (x) execute.
b. The read permission is required to read files or list the contents of a directory.
c. The write permission is required to write a file, or for a directory, to create or delete files or
directories in it.
d. The execute permission is required to run a file, and for a directory this permission is used to
access its children.
Sol- (d) The execute permission is required to run a file, and for a directory this permission is
used to access its children.
Unlike POSIX Model, in HDFS no file can be executed so the execute permission is ignored for a
file.
Q20. In HDFS, the client opens the file it wishes to read by calling _______ on the
FileSystem object.
Sol- open()
Q21. In HDFS the FileSystem object is an instance of ______________.
Sol- DistributedFileSystem
Q22. The DistributedFileSystem calls the namenode, using _____________, to

determine the location of the first few blocks in the flie.
Sol- Remote Procedure Call (RPC)
Q23. In the event of an RPC being generated by the DistributedFileSystem; for each
block namenodes return the address of the datanodes that have copy of that block. These
datanodes are listed according to their _________________.
Sol- proximity to the client
Q24. The DistributedFileSystem returns an ____________________, which is an input

stream that supports file seeks to the client for it to read data from.
Sol- FSDataInputStream
Q25. The FSDataInputStream in turn wraps a ______________________, which manages

the datanode and namenode I/O.
Sol- DFSInputStream
Q26. On receiving the read() call from the client, DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then ___________________________.
Sol- connects to the closest datanode for the first block in the file.
Q27. During the process of reading a block, the ________ call is called recursively till the end of
block is reached. At this point the DFSInputStream will close the connection to the
datanode and then find the best datanode for the next block.
Sol- read()
Q28. When the client has finished reading it calls _____ on the FSDataInputStream.
Sol- close()
Q29. During reading, if the DFSInputStream encounters an error while communicating with
a datanode,
a. It will try the next closest one.

b. It will remember the datanode which failed.
c. Both (a) and (b)
d. None of the above.
Sol- (c)
Remembering the datanode that failed helps the DFSInputStream avoid needless retry of
the same for all later blocks.
Q30. To detect data corruption during data transfer, the DFSInputStream utilises-
a. heartbeat monitoring,
b. checksum verification,
c. edit log parity check,
d. Any of these depending on the HDFS configuration by the administrator.
Sol- (b) Checksum verification
Q31. If a corrupted block is found, the DFSInputStream-
a. attempts to read a replica of the block from another datanode.
b. reports the corrupted block to the namenode.
c. waits for the explicit instruction from the client.
d. (a) and (b)
Sol- (d)
Q32. HDFS is inspired by which of following Google project
a. BigTable
b. GFS
c. MapReduce
d. GCP
Sol- (b) GFS(Google File System)
Q33. In HDFS, data node sends frequent __________ to namenode.
Sol- Hearbeats
Q34. Clients connect to ________ for I/O
Sol- datanode
Q35. For reading/writing data to/from HDFS, clients first connect to ______________.
Sol- namenode
Q36. The namenode loses its only copy of fsimage file. We can recover this from which of the
following?
a. Secondary Namenode
b. No recovery possible.
c. Datanodes
d.Checkpoint Node
Sol- (b) No recovery possible
Q37. When a backup node is used in a cluster there is no need of which of the following?
a. Standby Namenode
b. Checkpoint Node
c. Secondary Namenode
d. Secondary Datanode
Sol- (b) Checkpoint Node
Q38. The HDFS command to create the copy of a file from a local system is which of the
following?
a. copyFromLocal
b. copyfromlocal
c. copyLocal
d. CopyLocal
Sol- (a) copyFromLocal
Q39. HDFS provides a command line interface called __________ used to interact with HDFS
Sol- FS Shell
Q40. Which of the following is the daemon of HDFS?
a. Secondary namenode
b. Node Manager
c. Resource manager
d. All of these
Sol- (a) Secondary namenode

Q41. Which of the following stores metadata?
a. Datanode
b. Namenode
c. Secondary Namenode
d. Both (b) and (c)
Sol- (b) Namenode
Q42. Which of the following statement is true about Secondary NameNode
a. It stores the modified fsimage into persistent storage.

b. It stores the merged fsimage with edit logs back to active namenode.
c. It doesn’t store the modified fsimage into persistent storage.
d. Both (a) and (b)
Sol- (a) It stores the modified fsimage into persistent storage.
Q43. Which of the following is the core component of HDFS?
a. Node Manager
b. Datanode
c. Resource Manager
d. HDFS Daemon
Sol- (b) Datanode
Q44. Which statement is true about Namenode?
a. It is the slave node that stored actual data.

b. It is the masternode that stores metadata.
c. It acts as the Network Manager for the cluster.
d. It is not responsible for maintenance of the cluster.
Sol- (b) It is the masternode that stores metadata.
Q45. Which of the following is true about metadata?
a. Metadata shows the Structure of HDFS directories/files.

b. It contains information like number of blocks, their location, replicas.
c. fsimage & edit log are metadata files.
d. All of the above.
Q46. Which statement is true about DataNode?
a. It is the slave node that stores actual data.

b. The client doesn’t communicate with it at all.
c. It is a high performance specialised machine.
d. It contains the replica of fsimage files & edit log.
Sol- (a) It is the slave node that stores actual data.
It’s commodity hardware and client does communicate with it.
Q47. Is Secondary Namenode a Backup node?
Sol- No, it’s not a backup node.
Q48. Which of the following is NOT a type of metadata in Namenode?
a. List of Files
b. Block Location of Files
c. Number of File Records
d. File Access Control Information
Sol- (c) Number of File Records
Q49. Which of the following is the Single Point of Failure (SPOF)?
a. Namenode
b. Secondary Namenode
c. Checkpoint Node
d. Datanode
e. All of these
Sol- (a) Namenode
Q50. What is Rake Awareness in HDFS?
Sol- In a large cluster of Hadoop, in order to improve the network traffic while reading/writing
HDFS file, namenode chooses the datanode which is closer to the same rack or nearby rack to
Read/Write request. Namenode achieves rack information by maintaining the rack id’s of each
datanode. This concept that chooses closer datanodes based on the rack information is called
Rack Awareness in Hadoop.

Hdfs

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hdfs

Încărcat de

Drepturi de autor:

Formate disponibile

Q1 Which of these storage systems can Hadoop work with?

Sol- (d) All of the above

1. Write Once Read Many data processing pattern is used.

Sol- (d) All of the above

Sol- (d) 1, 3 & 4

a. Blocks make distributed storage of the large files possible.

Sol- (d) All of the above

Q6. Which of the following are true for Namenodes?

Sol- (d) 1 & 3

Q7. What is block cache?

Q8. What is HDFS Federation?

Sol- It is creation of a mult-namenode architecture to ensure horizontal scalability which is key to

Sol- The main changes registered are-

Q11. What is QJM?

Q12. What is a failover controller?

Q13. What is graceful failover?

Q14. Why is fencing needed in HDFS High Availability implementations?

Q15. Why is QJM more popular than NFS filer?

Q16. What are common fencing strategies?

Q17. In HDFS, client failover is handled transparently by the _____________.

Sol- client library

Sol- POSIX Model

Q21. In HDFS the FileSystem object is an instance of ______________.

Q22. The DistributedFileSystem calls the namenode, using _____________, to

Sol- Remote Procedure Call (RPC)

Sol- proximity to the client

Q24. The DistributedFileSystem returns an ____________________, which is an input

Q25. The FSDataInputStream in turn wraps a ______________________, which manages

a. It will try the next closest one.

c. edit log parity check,

d. Any of these depending on the HDFS configuration by the administrator.

Sol- (b) Checksum verification

Q31. If a corrupted block is found, the DFSInputStream-

a. attempts to read a replica of the block from another datanode.

b. reports the corrupted block to the namenode.

c. waits for the explicit instruction from the client.

d. (a) and (b)

Q32. HDFS is inspired by which of following Google project

Sol- (b) GFS(Google File System)

Q33. In HDFS, data node sends frequent __________ to namenode.

Sol- (b) No recovery possible

Sol- (b) Checkpoint Node

Sol- (a) copyFromLocal

Q40. Which of the following is the daemon of HDFS?

Sol- (a) Secondary namenode

Sol- (b) Namenode

Q42. Which of the following statement is true about Secondary NameNode

a. It stores the modified fsimage into persistent storage.

Sol- (a) It stores the modified fsimage into persistent storage.

Q43. Which of the following is the core component of HDFS?

Sol- (b) Datanode

Q44. Which statement is true about Namenode?

a. It is the slave node that stored actual data.

Sol- (b) It is the masternode that stores metadata.

Q45. Which of the following is true about metadata?

a. Metadata shows the Structure of HDFS directories/files.

Sol- (d) All of the above

Q46. Which statement is true about DataNode?

a. It is the slave node that stores actual data.

Sol- (a) It is the slave node that stores actual data.

It’s commodity hardware and client does communicate with it.

Q47. Is Secondary Namenode a Backup node?

Sol- No, it’s not a backup node.