Hadoop Interview 1

A Hadoop cluster is a collection of nodes.
Q1) What is Big Data?
Answer: Big Data is relative term. When Data can’t be handle using conventional systems like
RDBMS because Data is generating with very high speed, it is known as Big Data.
Q2) Why Big Data?
Answer: Since Data is growing rapidly and RDBMS can’t control it, Big Data technologies came
into picture.
Q3) What are 3 core dimension of Big Data.
Answer: Big Data have 3 core dimensions:
Volume
Variety
Velocity
Q4) Role of Volume in Big Data
Answer: Volume: Volume is nothing but amount of data. As Data is growing with high speed, a
huge volume of data is getting generated every second.
Q5) Role of variety in Big Data
Answer: Variety: So many applications are running nowadays like mobile, mobile sensors etc.
Each application is generating data in different variety.
Q6) Role of Velocity in Big Data
Answer: Velocity: This is speed of data getting generated. for example: Every minute, Instagram
receives 46,740 new photos. So day by day speed of data generation is getting higher.
Q7) Remaining 2 less known dimension of Big Data
Answer: There are two more V’s of Big Data. Below are less known V’s:
Veracity
Value
Q8) Role of Veracity in Big Data
Answer: Veracity: Veracity is nothing but the accuracy of data. Big Data should have some
accurate data in order to process it.
Q9) Role of Value in Big Data
Answer: Value: Big Data should contain some value to us. Junk Values/Data is not considered as
real Big Data.
Q10) What is Hadoop?
Answer: Hadoop: Hadoop is a project of Apache. This is a framework which is open Source.
Hadoop is use for storing Big data and then processing it.
Q11) Why Hadoop?
Answer: In order to process Big data, we need some framework. Hadoop is an open source
framework which is owned by Apache organization. Hadoop is the basic requirement when we
think about processing big data.
Q12) Connection between Hadoop and Big Data
Answer: Big Data will be processed using some framework. This framework is known as Hadoop.
Q13) Hadoop and Hadoop Ecosystem

Answer: Hadoop Ecosystem is nothing but a combination of various components. Below are the
components which comes under Hadoop Ecosystem’s Umbrella:
HDFS
YARN
MapReduce
Pig
Hive
Sqoop, etc.
Q14) What is HDFS.
Answer: HDFS: HDFS is known as Hadoop Distributed File System. Like Every System have one file
system in order to see/manage files stored, in the same way Hadoop is having HDFS which works
in distributed manner.
Q15) Why HDFS?
Answer: HDFS is the core component of Hadoop Ecosystem. Since Hadoop is a distributed
framework and HDFS is also distributed file system. It is very well compatible with Hadoop.
Q16) What is YARN
Answer: YARN: YARN is known as Yet Another Resource Manager. This is a project of Apache
Hadoop.
Q17) Use of YARN.
Answer: YARN is use for managing resources. Jobs are scheduled using YARN in Apache Hadoop.
Q18) What is MapReduce?
Answer: MapReduce: MapReduce is a programming approach which consist of two steps: Map
and Reduce. MapReduce is the core of Apache Hadoop.
Q19) Use of MapReduce
Answer: MapReduce is a programming approach to process our data. MapReduce is use to

process Big Data.
Q20) What is Pig?
Answer: This is a project of Apache. It is a platform using which huge datasets are analyzed. It
runs on the top of MapReduce.
Q21) Use of Pig
Answer: Pig is use for the purpose of analyzing huge datasets. Data flow are created using Pig in
order to analyze data. Pig Latin language is use for this purpose.
Q22) What is Pig Latin
Answer: Pig Latin is a script language which is used in Apache Pig to create Data flow in order to
analyze data.
Q23) What is Hive?
Answer: Hive is a project of Apache Hadoop. Hive is a dataware software which runs on the top
of Hadoop.
Q24) Use of Hive
Answer: Hive works as a storage layer which is used to store structured data. This is very useful
and convenient tool for SQL user as Hive use HQL.
Q25) What is HQL?
Answer: HQL is an abbreviation of Hive Query Language. This is designed for those user who are
very comfortable with SQL. HQL is use to query structured data into hive.
Q26) What is Sqoop?
Answer: Sqoop is a short form of SQL to Hadoop. This is basically a command line tool to transfer
data between Hadoop and SQL and vice-versa.
Q27) Use of Sqoop?
Answer: Sqoop is a CLI tool which is used to migrate data between RDBMS to Hadoop and vice-
versa.
Q28) What are other components of Hadoop Ecosystem?
Answer: Below are other components of Hadoop Ecosystem:
a) HBase
b) Oozie
c) Zookeeper
d) Flume etc.
Q29) Difference Between Hadoop and HDFS
Answer: Hadoop is a framework while HDFS is a file system which works on the top of Hadoop.
Q30) How to access HDFS
Answer: below is command:
hdfs fs or hdfs dfs
Q31) How to create directory in HDFS

hdfs fs -mkdir <dir_name>
Q32) How to keep files in HDFS
hdfs fs -put <source_file_path> <destination_file_path>
or
hdfs fs -copyfromLocal <source_file_path> <destination_file_path>
Q33) How to copy file from HDFS to local
hdfs fs -copyToLocal <source_file_path> <destination_file_path>
Q34) How to Delete directory from HDFS
hdfs fs -rm <dir_name>
Q35) How to Delete file from HDFS

hdfs fs -rm <file_name>
Become an Big Data Hadoop Certified Expert in 25Hours
Q36) How to Delete directory and files recursively from HDFS
hdfs fs -rm -r <file_path>
Q37) How to read file in HDFS
hdfs fs -cat <file_path?
Q38) What are the other file system available in market?
Answer: FAT, NAS, EXT are the well-known file systems available in market.
Q39) What are basic steps to be performed while working with big data?
Answer: below are the basic steps to be done while working with Big Data:
Data Ingestion
Data Storage
Data Processing
Q40) What is data ingestion?
Answer: Before Big Data came into Picture, our data used to reside into RDBMS. Data Ingestion is
a process to move/ingest your data from one place to another place. In the reference of Big
Data, Data movement from RDBMS to Hadoop is known as Data Ingestion.
Q41) Explain data storage
Answer: This steps comes into picture after Data Ingestion. Ingested data is stored into different
storage layers like: HDFS, Hive tables etc.
Q42) What is data processing in big data?
Answer: Data Processing: Once you have data in HDFS, Data is processed for different purpose.
Data can be processed using MapReduce, Hive tables etc.
Q43) What is unstructured data?
Answer: There are a huge number of source available which generate different types of data.
Some Sources generate data which can’t be stored into tables i.e. that data is not in tabular
form. Such data is known as Unstructured Data.
Q44) What are the storage layer available to store unstructured data?
Answer: HBase, Cassandra, MongoDB are well known storage layers available in market to store
unstructured data.
Q45) What are the most important qualities of Hadoop?
Answer: below are most of the well-known and useful features of Hadoop:
Open Source
Distributed Processing
Fault tolerant
High available
Commodity Hardware
Q46) What do you mean by Open Source?

Answer: Hadoop is a framework given by Apache. Frameworks which are available for free of
cost are known as Open Source.
Q47) What do you mean by Distributed processing?
Answer: Data stored in Hadoop is distributed across clusters in order to give better performance
and to make data highly available.
Q48) What is Fault tolerance in Hadoop?
Answer: Since Data is Highly available in Hadoop. There is very minimum or no chance to lose
data as each data is replicated 3 times by default. So Hadoop is known as Highly Fault tolerant
framework.
Q49) What is High availability in Hadoop?
Answer: Hadoop stores all data 3 times i.e. it makes 3 copy of each data. This number can be
change. By doing this, Hadoop makes data highly available as there is no chance to lose data. If
data will not be available on any node, Hadoop will bring data from other node and will provide
to client.
Q50) What is replication factor in Hadoop?
Answer: Replication factor is the term which is used to decide the number by which each data
will be replicated into Hadoop. User can change replication factor according to their need. By
default, its value is 3.
Q51) What are th elimitation with Traditional RDBMS?
Answer:
prcessing time is high
storage limitation
stores only the structured data which is in the form of rows and columns
cost of hardware is high
cost of sofware is high
not a open source. we cannot do any customization
Q52) What are the 4 V’s of big data hadoop ? or features of Bigdata hadoop?
Answer:
Volume
velocity
veracity
variety
Q53) What is Name Node?
Answer:
Name node is master in hadoop architecture which stores the meta data of data nodes.
in hadoop1.x only one name node is used where as in hadoop 2.x 2 name nodes are used.
Name node is the sinlge point of failure in hadoop1.x
Q54) What are datanodes?
Answer: Data nodes are the slaves which are used to store data/ block of file
Q55) What is the default replication factor?
Answer: 3 set in hdfs-site.xml

can be set according to data high availability
Q56) What are JT, TT, Secondary name node in hadoop architecture?
Answer:
JT – Job Tracker which assigns jobs to Task trackers.
TT- Task tracker which executes the job assigned by JT
Secondary name node- its a name node keeps the metadata information of name node.
After evry 30 min, the name node info is updated in secondary name node.
Q57) What is rack awareness?
Answer:kepping atleast one copy of block in another emote tracker
Q58) Write the command to copy a file from linux to hdfs
Answer: make sure you have a file called simple.txt in a path /home/training/simple.txt
$hadoop fs – ls copyFromLocal /home/training/simple.txt /GangBoard_HDFS
where GangBoard_HDFS is a directory created in HDFS
instead of copyFromLocal we can use put.

put and copyFromLocal give the same result.
Q59) Write command to copy a file from HDFS to linux(local).
Answer: $hadoop fs – ls copyToLocal /GangBoard_HDFS/simple.txt /home/training
Get and copyToLocal give the same result
Q60) What is Hive?
Answer: Hive is tool used for querrying and processing a data. Hive is developed by Facebook
and donated to ApacheSoftwareFoundation.
Hives is to store mostly a structured data.
Q61) What happened after creating a table in hive?
Answer:
all metadata will be stored in meta store database
a default directory will be created in /hive/usr/warehouse with the table name.
Q62) What are the 2 types of table in hive?
Answer:
–>Managed/internal table
Here once the table gets deleted both meta data and actual data is deleted
–>external table
Here once the table gets deleted only the mata data gets deleted but not the actual data.
Q63) How to managed create a table in hive?
Answer: hive>create table student(sname string, sid int) row format delimited fileds terminated
by ‘,’;
//hands on
hive>describe student;
Q64) How to load data into table created in hive?
Answer: hive>load data local inpath /home/training/simple.txt into table student;
//hands on
hive> select * from student;
Q65) How to create/load data into exteranal tables?
Answer: *without location
hive>create external table student(sname string, sid int) row format delimited fileds terminated
by ‘,’;
hive>load data local inpath /home/training/simple.txt into table student;

*With Location
hive>create external table student(sname string, sid int) row format delimited fileds terminated
by ‘,’ location /GangBoard_HDFS;
Here no need of load command
Became an Big Data Hadoop Expert with Certification in 25hours
Q66) Write a command to write static partitioned table.
Answer: hive>create table student(sname string, sid int) partitioned by(int year) row format
delimited fileds terminated by ‘,’;
Q67) How to load a file in static partition?
Answer: hive>load data local inpath /home/training/simple2018.txt into table student

partition(year=2018);
Q68) Write a commands to write dynamic partitioned table.
Answer:
–> create a normal table
hive>create table student(sname string, sid int) row format delimited fileds terminated by ‘,’;
–>load data
hive>load data local inpath /home/training/studnetall.txt into table student ;

–>create a partitioned table
hive>create table student_partition(sname string, sid int) partitioned by(int year) row format
delimited fileds terminated by ‘,’;
–>set partitions
hive>set hive.exec.dynamic.partition.mode = nonstrict;
–>insert data
hive>insert into table student_partition select * from student;
–>drop normal table
hive>drop table student;
Q69) What is pig?
Answer:Pig is an abstraction over map reduce. It is a tool used to deal with huge amount of
structured and semi structed data.
Q70) What is atom in pig?
Answer: its a small piece of data or a filed
eg: ‘shilpa’
Q71) What is tuple?
ordered set of filed
(shilpa, 100)
Q72) Bag in pig?
Answer: un-ordered set of tuples
eg.{(sh,1),(ww,ww)}
Q73) What is relation?
Answer: bag of tuples
Q74) What is hbase?
Answer: its a distributed column oriented database built on top of hadoop file system
it is horizontally scalable
Q75) Difference between hbase and rdbms
Answer: RDMBS is schema based
hbase is not
RDMBS only structured data

hbase structured and semi structured data.
RDMBS involves transactions
Hbase no transactions
Q76) What is table in hbase?
Answer: collection of rows
Q77) What is row in hbase?
Answer: collection of column families
Q78) Column family in hbase?
Answer:collection of columns
Q79) What is column?
Answer:collection of key value pair
Q80) How to start hbase services?
Answer:
>hbase shell
hbase>start -hbase.sh
Q81) DDL commands used in hbase?
Answer:
create
alter
drop
drop_all
exists
list
enable
is_enabled?
disable
is_disbled?
Q82) DML commands?
Answer:
put
get
scan
delete
delete_all
Q83) What services run after running hbase job?
Answer:
Name node
data node
secondary NN
JT
TT
Hmaster
HRegionServer
HQuorumPeer
Q84) How to create table in hbase?
Answer:>create ’emp’, ‘cf1′,’cf2’
Q85) How to list elements
Answer:>scan ’emp’
Q86) Scope operators used in hbase?
Answer:
MAX_FILESIZE
READONLY
MEMSTORE_FLUSHSIZE
DEFERRED_LOG_FLUSH
Q87) What is sqoop?
Answer: sqoop is an interface/tool between RDBMS and HDFS to importa nd export data
Q88) How many default mappers in sqoop?
Answer: 4
Q89) What is map reduce?
Answer: map reduce is a data processing technique for distributed computng base on java
map stage
reduce stage
Q90) list few componets that are using big data
Answer:
facebook
adobe
yahoo
twitter
ebay
Q91) Write a quert to import a file in sqoop
Answer: $>sqoop-import –connect jdbc:mysql://localhost/GangBoard
username hadoop
password hadoop
table emp
target_dir sqp_dir
fields_terminated_by ‘,’
m1
Q92) What is context in map reduce?
Answer: it is an object having the information about hadoop configuration
Q93) How job is started in map reduce?
Answer: To start a job we need to create a configuration object.

configuration c = new configuration();
Job j = new Job(c,”wordcount calculation);
Q94) How to load data in pig?
Answer: A= load ‘/home/training/simple.txt’ using PigStorage ‘|’ as (sname : chararray, sid: int,
address:chararray);
Q95) What are the 2 modes used to run pig scripts?
Answer: local mode
pig -x local
pig -x mapreduce
Q96) How to show up details in pig ?
Answer: dump command is used.
grunt>dump A;
Q97) How to fetch perticular columns in pig?
Answer: B = foreach A generate sname, sid;
Q100) How to restrict the number of lines to be printed in pig ?
Answer: c=limit B 2;
Get Big Data Hadoop Online Training
Q101) Define Big Data
Answer: Big Data is defined as a collection of large and complex of unstructured data sets from
where insights are derived from the Data Analysis using open-source tools like Hadoop.
Q102) Explain The Five Vs of Big Data
Answer: The five Vs of Big Data are –
Volume – Amount of data in the Petabytes and Exabytes
Variety – Includes formats like an videos, audio sources, textual data, etc.
Velocity – Everyday data growth which are includes conversations in forums,blogs,social media
posts,etc.
Veracity – Degree of accuracy of data are available
Value – Deriving insights from collected data to the achieve business milestones and new heights
Q103) How is Hadoop related to the Big Data ? Describe its components?
Answer: Apache Hadoop is an open-source framework used for the storing, processing, and
analyzing complex unstructured data sets for the deriving insights and actionable intelligence for
businesses.
The three main components of Hadoop are-

MapReduce – A programming model which processes large datasets in the parallel
HDFS – A Java-based distributed file system used for the data storage without prior organization
YARN – A framework that manages resources and handles requests from the distributed
applications
Q104) Define HDFS and talk about their respective components?
Answer: The Hadoop Distributed File System (HDFS) is the storage unit that’s responsible for the
storing different types of the data blocks in the distributed environment.
The two main components of HDFS are-
NameNode – A master node that processes of metadata information for the data blocks
contained in the HDFS
DataNode – Nodes which act as slave nodes and a simply store the data, for use and then
processing by the NameNode.
Q105) Define YARN, and talk about their respective components?
Answer: The Yet Another Resource Negotiator (YARN) is the processing component of the
Apache Hadoop and is responsible for managing resources and providing an execution
environment for said of processes.
The two main components of YARN are-
ResourceManager– Receives processing requests and allocates its parts to the respective Node
Managers based on processing needs.
Node Manager– Executes tasks on the every single Data Node
Q106) Explain the term ‘Commodity Hardware?
Answer: Commodity Hardware refers to hardware and components, collectively needed, to run
the Apache Hadoop framework and related to the data management tools. Apache Hadoop
requires 64-512 GB of the RAM to execute tasks, and any hardware that supports its minimum
for the requirements is known as ‘Commodity Hardware.
Q107) Define the Port Numbers for NameNode, Task Tracker and Job Tracker?
Answer: Name Node – Port 50070
Task Tracker – Port 50060
Job Tracker – Port 50030
Q108) How does HDFS Index Data blocks? Explain.
Answer: HDFS indexes data blocks based on the their respective sizes. The end of data block
points to address of where the next chunk of data blocks get a stored. The DataNodes store the
blocks of datawhile the NameNode manages these data blocks by using an in-memory image of
all the files of said of data blocks. Clients receive for the information related to data blocked from
the NameNode.
Q109) What are Edge Nodes in Hadoop?
Answer: Edge nodes are gateway nodes in the Hadoop which act as the interface between the
Hadoop cluster and external network.They run client applications and cluster administration
tools in the Hadoop and are used as staging areas for the data transfers to the Hadoop cluster.
Enterprise-class storage capabilities (like 900GB SAS Drives with Raid HDD Controllers) is
required for the Edge Nodes,and asingle edge node for usually suffices for multiple of Hadoop
clusters.
Q110) What are some of the data management tools used with the Edge Nodes in Hadoop?
Answer: Oozie,Ambari,Hue,Pig and Flume are the most common of data management tools that
work with edge nodes in the Hadoop. Other similar tools include to HCatalog,BigTop and Avro.
Q111) Explain the core methods of a Reducer?
Answer: There are three core methods of a reducer. They are-
setup() – Configures different to parameters like distributed cache, heap size, and input data.
reduce() – A parameter that is called once per key with the concerned on reduce task
cleanup() – Clears all temporary for files and called only at the end of on reducer task.
Q112) Talk about the different tombstone markers used for deletion purposes in HBase.?
Answer: There are three main tombstone markers used for the deletion in HBase. They are-
Family Delete Marker – Marks all the columns of an column family
Version Delete Marker – Marks a single version of an single column
Column Delete Marker– Marks all the versions of an single column
Q113) How would you transform unstructured data into structured data?
Answer: How to Approach: Unstructured data is the very common in big data. The unstructured
data should be
transformed into the structured data to ensure proper data are analysis.
Q114) Which hardware configuration is most beneficial for Hadoop jobs?
Answer: Dual processors or core machines with an configuration of 4 / 8 GB RAM and ECC
memory is ideal for
running Hadoop operations. However, the hardware is configuration varies based on the project-
specific
workflow and process of the flow and need to the customization an accordingly.
Q115) What is the use of the Record Reader in Hadoop?
Answer: Since Hadoop splits data into the various blocks, RecordReader is used to read the slit
data into the single record. For instance, if our input data is the split like:
Row1: Welcome to
Row2: GangBoard
It will be read as the “Welcome to GangBoard” using RecordReader.
Q116) What is Sequencefilein put format?
Answer: Hadoop uses the specific file format which is known as the Sequence file. The sequence
file stores data in the serialized key-value pair. Sequencefileinputformat is an input format to the
read sequence files.
Q117) What happens when two users try to access to the same file in HDFS?
Answer: HDFS NameNode supports exclusive on write only. Hence, only the first user will receive
to the grant for the file access & second that user will be rejected.
Q118) How to recover an NameNode when it’s are down?
Answer: The following steps need to execute to the make the Hadoop cluster up and running:
Use the FsImage which is file system for metadata replicate to start an new NameNode.
Configure for the DataNodes and also the clients to make them acknowledge to the newly
started NameNode.
Once the new NameNode completes loading to the last for checkpoint FsImage which is the
received to enough block reports are the DataNodes, it will start to serve the client.
In case of large of Hadoop clusters, the NameNode recovery process to consumes a lot of time
which turns out to be an more significant challenge in case of the routine maintenance.
Q119) What do you understand by the Rack Awareness in Hadoop?
Answer: It is an algorithm applied to the NameNode to decide then how blocks and its replicas
are placed.
Depending on the rack definitions network traffic is minimized between DataNodes within the
same of rack.
For example, if we consider to replication factor as 3, two copies will be placed on the one rack
whereas the third copy in a separate rack.
Q120) What are the difference between of the “HDFS Block” and “Input Split”?
Answer: The HDFS divides the input data physically into the blocks for processing which is known
as the HDFS Block.
Input Split is a logical division of data by the mapper for mapping operation.[/toggle_content]
Q121) DFS can handle a large volume of data then why do we need Hadoop framework?
Answer: Hadoop is not only for the storing large data but also to process those big data. Though
DFS (Distributed File System) tool can be store the data, but it lacks below features-
It is not fault for tolerant Data movement over the network depends on bandwidth.
Q122) What are the common input formats are Hadoop?
Answer: Text Input Format – The default input format defined in the Hadoop is the Text Input
Format.
Sequence File Input Format – To read files in the sequence, Sequence File Input Format is used.
Key Value Input Format – The input format used for the plain text files (files broken into lines) is
the Key Value for Input Format.
Q123) Explain some important features of Hadoop?
Answer: Hadoop supports are the storage and processing of big data. It is the best solution for
the handling big data challenges. Some of important features of Hadoop are
1. Open Source – Hadoop is an open source framework which means it is available free of cost
Also,the users are allowed to the change the source code as per their requirements.
2. Distributed Processing – Hadoop supports distributed processing of the data i.e. faster
processing.
The data in Hadoop HDFS is stored in the distributed manner and MapReduce is responsible for
the parallel processing of data.
3. Fault Tolerance – Hadoop is the highly fault-tolerant. It creates three replicas for each block at
different nodes, by the default. This number can be changed in according to the requirement. So,
we can recover the data from the another node if one node fails. The detection of node of failure
and recovery of data is done automatically.
4. Reliability – Hadoop stores data on the cluster in an reliable manner that is independent of
the machine. So, the data stored in Hadoop environment is not affected by the failure of
machine.
5. Scalability – Another important feature of Hadoop is the scalability. It is compatible with the
other hardware and we can easily as the new hardware to the nodes.
6. High Availability – The data stored in Hadoop is available to the access even after the
hardware failure. In case of hardware failure, the data can be accessed from the another path.
Q124) Explain the different modes are which Hadoop run?
Answer: Apache Hadoop runs are the following three modes –
Standalone (Local) Mode – By default, Hadoop runs in the local mode i.e. on a non-
distributed,single node. This mode use for the local file system to the perform input and output
operation. This mode does not support the use of the HDFS, so it is used for debugging. No
custom to configuration is needed for the configuration files in this mode.
In the pseudo-distributed mode, Hadoop runs on a single of node just like the Standalone mode.
In this mode, each daemon runs in the separate Java process. As all the daemons run on the
single node, there is the same node for the both Master and Slave nodes.
Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on the separate
individual nodes and thus the forms a multi-node cluster. There are different nodes for the
Master and Slave nodes.
Q125) What is the use of jps command in Hadoop?
Answer: The jps command is used to the check if the Hadoop daemons are running properly or
not. This command shows all the daemons running on the machine i.e. Datanode, Namenode,
NodeManager, ResourceManager etc.
Q126) What are the configuration parameters in the “MapReduce” program?
Answer: The main configuration parameters in “MapReduce” framework are:
Input locations of Jobs in the distributed for file system
Output location of Jobs in the distributed for file system
The input format of data
The output format of data
The class which contains for the map function
The class which contains for the reduce function

JAR file which contains for the mapper, reducer and the driver classes
Q127) What is a block in HDFS? what is the default size in Hadoop 1 and Hadoop 2? Can we
change the block size?
Answer: Blocks are smallest continuous of data storage in a hard drive. For HDFS, blocks are
stored across Hadoop cluster.
The default block size in the Hadoop 1 is: 64 MB
The default block size in the Hadoop 2 is: 128 MB
Yes,we can change block size by using the parameters – dfs.block.size located in the hdfs-site.xml
file.
Q128) What is Distributed Cache in the MapReduce Framework?
Answer: Distributed Cache is an feature of the Hadoop MapReduce framework to cache files for
the applications.
Hadoop framework makes cached files for available for every map/reduce tasks running on the
data nodes.
Hence, the data files can be access the cache file as the local file in the designated job.
Q129) What are the three running modes of the Hadoop?
Answer: The three running modes of the Hadoop are as follows:

Standalone or local: This is the default mode and doesn’t need any configuration. In this mode,
all the following components for Hadoop uses local file system and runs on single JVM –
NameNode
DataNode
ResourceManager
NodeManager
Pseudo-distributed: In this mode, all the master and slave Hadoop services is deployed and
executed on a single node.
Fully distributed: In this mode, Hadoop master and slave services is deployed and executed on
the separate nodes.
Q130) Explain JobTracker in Hadoop?
Answer: JobTracker is a JVM process in the Hadoop to submit and track MapReduce jobs.
JobTracker performs for the following activities in Hadoop in a sequence –
JobTracker receives jobs that an client application submits to the job tracker
JobTracker notifies NameNode to determine data node
JobTracker allocates TaskTracker nodes based on the available slots.

It submits the work on the allocated TaskTracker Nodes,
JobTracker monitors on the TaskTracker nodes.
Q131) What are the difference configuration files in Hadoop?
Answer: The different configuration files in Hadoop are –
core-site.xml – This configuration file of contains Hadoop core configuration settings,
for example, I/O settings, very common for the MapReduce and HDFS. It uses hostname an port.
mapred-site.xml – This configuration file specifies a framework name for MapReduce by the
setting
mapreduce.framework.name
hdfs-site.xml – This configuration file contains of HDFS daemons configuration for settings. It
also
specifies default block for permission and replication checking on HDFS.
yarn-site.xml – This configuration of file specifies configuration settings for the

ResourceManager and
NodeManager.
Q132) What are the difference between Hadoop 2 and Hadoop 3?
Answer: Following are the difference between Hadoop 2 and Hadoop 3 –
Kerberos are used to the achieve security in Hadoop. There are 3 steps to access an service while
using Kerberos, at a high level. Each step for involves a message exchange with an server.
Authentication – The first step involves authentication of the client to authentication server, and
then provides an time-stamped TGT (Ticket-Granting Ticket) to the client.
Authorization – In this step, the client uses to received TGT to request a service ticket from the
TGS (Ticket Granting Server)
Service Request – It is the final step to the achieve security in Hadoop. Then the client uses to
service ticket to authenticate an himself to the server.
Q133) What is commodity hardware?
Answer: Commodity hardware is an low-cost system identified by the less-availability and low-
quality. The commodity hardware for comprises of RAM as it performs an number of services
that require to RAM for the execution. One doesn’t require high-end hardware of configuration
or super computers to run of Hadoop, it can be run on any of commodity hardware.
Q134) How is NFS different from HDFS?
There are a number of the distributed file systems that work in their own way. NFS (Network File
System) is one of the oldest and popular distributed file an storage systems whereas HDFS
(Hadoop Distributed File System) is the recently used and popular one to the handle big data.
Q135) How do Hadoop MapReduce works?
Answer: There are two phases of the MapReduce operation.

Map phase – In this phase, the input data is split by the map tasks. The map tasks run in the
parallel. These split data is used for analysis for purpose.
Reduce phase – In this phase, the similar split data is the aggregated from the entire to collection
and shows the result.[/toggle_content]
Q136) What is MapReduce? What are the syntax you use to run a MapReduce program?
Answer: MapReduce is a programming model in the Hadoop for processing large data sets over
an cluster of the computers, commonly known as the HDFS. It is a parallel to programming
model.
The syntax to run a MapReduce program is the hadoop_jar_file.jar /input_path /output_path.
Q137) What are the different file permissions in the HDFS for files or directory levels?
Answer: Hadoop distributed file system (HDFS) uses an specific permissions model for files and
directories.
1. Following user levels are used in HDFS –
Owner
Group
Others.
2. For each of the user on mentioned above following permissions are applicable –
read (r)
write (w)
execute(x).
3. Above mentioned permissions work on differently for files and directories.
For files
The r permission is for reading an file
The w permission is for writing an file.
For directories
The r permission lists the contents of the specific directory.
The w permission creates or deletes the directory.
The X permission is for accessing the child directory.
Q138) What are the basic parameters of a Mapper?
Answer: The basic parameters of a Mapper is the LongWritable and Text and Int Writable
Q139) How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to the stop all the daemons first. The Hadoop
directory contains sbin as directory that stores to the script files to stop and start daemons in the
Hadoop.
Use stop daemons command /sbin/stop-all.sh to the stop all the daemons and then use
/sin/start-all.sh
command to start all the daemons again.
Q140) Explain the process that overwrites the replication factors in HDFS?
Answer: There are two methods to the overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is the changed on the basis of file using to Hadoop FS shell.
The command used for this is:
$hadoop fs – setrep –w2/my/test_file
Here, test_file is the filename that’s replication to factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on the directory basis i.e. the replication factor
for all the files under the given directory is modified.
$hadoop fs –setrep –w5/my/test_dir
Here, test_dir is the name of the directory, then replication factor for the directory and all the
files in it will be set to 5.
Q141) What will happen with a NameNode that doesn’t have any data?
Answer: A NameNode without any for data doesn’t exist in Hadoop. If there is an NameNode, it
will contain the some data in it or it won’t exist.[/toggle_content]
Q142) Explain NameNode recovery process?
Answer: The NameNode recovery process involves to the below-mentioned steps to make for
Hadoop cluster running:
In the first step in the recovery process, file system metadata to replica (FsImage) starts a new
NameNode.
The next step is to configure DataNodes and Clients. These DataNodes and Clients will then
acknowledge of new NameNode.
During the final step, the new NameNode starts serving to the client on the completion of last
checkpoint FsImage for loading and receiving block reports from the DataNodes.
Note: Don’t forget to mention, this NameNode recovery to process consumes an lot of time on
large Hadoop clusters. Thus, it makes routine maintenance to difficult. For this reason, HDFS high
availability architecture is recommended to use.
Q143) How Is Hadoop CLASSPATH essential to start or stop Hadoop daemons?
Answer: CLASSPATH includes necessary directories that contain the jar files to start or stop
Hadoop daemons.
Hence, setting the CLASSPATH is essential to start or stop on Hadoop daemons.
However, setting up CLASSPATH every time its not the standard that we follow. Usually
CLASSPATH is the written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run to Hadoop,
it will load the CLASSPATH is automatically.
Q144) Why is HDFS only suitable for large data sets and not the correct tool to use for many
small files?
Answer: This is due to the performance issue of the NameNode.Usually, NameNode is allocated
with the huge space to store metadata for the large-scale files. The metadata is supposed to be
an from a single file for the optimum space utilization and cost benefit. In case of the small size
files, NameNode does not utilize to the entire space which is a performance optimization for the
issue.
Q145) Why do we need Data Locality in Hadoop?
Answer: Datasets in HDFS store as the blocks in DataNodes the Hadoop cluster. During the
execution of the MapReducejob the individual Mapper processes to the blocks (Input Splits). If
the data does not reside in the same node where the Mapper is the executing the job, the data
needs to be copied from DataNode over the network to mapper DataNode. Now if an
MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from the
other DataNode in cluster simultaneously, it would cause to serious network congestion which is
an big performance issue of the overall for system. Hence, data proximity are the computation is
an effective and cost-effective solution which is the technically termed as Data locality in the
Hadoop. It helps to increase the overall throughput for the system.
Enroll Now!
Q146) What’s Big Big Data or Hooda?
Answer: Only a concept that facilitates handling large data databases. Hadoop has a single
framework for dozens of tools. Hadoop is primarily used for block processing. The difference
between Hadoop, the largest data and open source software, is a unique and basic one.
Q147) Big data is a good life?
Answer: Analysts are increasing demand for industry and large data buildings. Today, many
people are looking to pursue their large data industry by having great data jobs like freshers.
However, the larger data itself is just a huge field, so it’s just Hadoop jobs for freshers
Q148) What is the great life analysis of large data analysis?
Answer: The large data analytics has the highest value for any company, allowing it to make
known decisions and give the edge among the competitors. A larger data career increases the
opportunity to make a crucial decision for a career move.
Q149) Hope is a NoSQL?
Answer: Hadoop is not a type of database, but software software that allows software for
computer software. It is an application of some types, which distributes noSQL databases (such
as HBase), allowing thousands of servers to provide data in lower performance to the rankings
Q150) Need Hodop to Science?
Answer: Data scientists have many technical skills such as Hadoto, NoSQL, Python, Spark, R, Java
and more. … For some people, data scientist must have the ability to manage using Hoodab
alongside a good skill to run statistics against data set.
Q151)What is the difference between large data and large data analysis?
Answer: On the other hand, data analytics analyzes structured or structured data. Although they
have a similar sound, there are no goals. … Great data is a term of very large or complex data
sets that are not enough for traditional data processing applications
Q152) Why should you be a data inspector?
Answer: A data inspector’s task role involves analyzing data collection and using various
statistical techniques. … When a data inspector interviewed for the job role, the candidates must
do everything they can to see their communication skills, analytical skills and problem solving
skills
Q153) Great Data Future?
Answer: Big data refers to the very large and complex data sets for traditional data entry and
data management applications. … Data sets continue to grow and applications are becoming
more and more time-consuming, with large data and large dataprocessing cloud moving more
Q154) What is a data scientist on Facebook?
Answer: This assessment is provided by 85 Facebook data scientist salary report (s) employees or
based on statistical methods. When a factor in bonus and extra compensation, a data scientist
on Facebook expected an average of $ 143,000 in salary
Q155) Can Hedop Transfer?
Answer: HODOOP is not just enough to replace RDGMS, but it is not really what you want to do.
… Although it has many advantages to the source data fields, Hadoopcannot (and usually does)
replace a data warehouse. When associated with related databases. However, this creates a
powerful and versatile solution.
Get Big Data Hadoop Course Now!
Q156) What’s happening in Hadoop?
Answer: MapReduce is widely used in I / O forms, a sequence file is a flat file containing binary
key / value pairs. Graphical publications are stored locally in sequencer. It provides Reader,
Writer and Seater classes. The three series file formats are:
Non-stick key / value logs.
Record key / value records are compressed – only ‘values’ are compressed here.
Pressing keys / value records – ‘Volumes’ are collected separately and shortened by keys and
values. The ‘volume’ size can be configured.
Q157) What is the Work Tracker role in Huda?
The task tracker’s primary function, resource management (managing work supervisors),
resource availability and monitoring of the work cycle (monitoring of docs improvement and
wrong tolerance).
This is a process that runs on a separate terminal, not often in a data connection.
The tracker communicates with the label to identify the location of the data.
The best mission to run tasks at the given nodes is to find the tracker nodes.
Track personal work trackers and submit the overall job back to the customer.
MapReduce works loads from the slush terminal.
Q158) What is the RecordReader application in Hutch?
Answer: Since the Hadoop data separates various blocks, recordReader is used to read split data
in a single version. For example, if our input data is broken:
Row1: Welcome
Row2: Intellipaat
It uses “Welcome to Intellipaat” using RecordReader.
Q159)What is Special Execution in Hooda?
Answer: A range of Hadoop, some sloping nodes, are available to the program by distributing
tasks at many ends. Tehre is a variety of causes because the tasks are slow, which are sometimes
easier to detect. Instead of identifying and repairing slow-paced tasks, Hopep is trying to find out
more slowly than he expected, then backs up the other equivalent task. Hadoop is the insulation
of this backup machine spectrum.
This creates a simulated task on another disk. You can activate the same input multiple times in
parallel. After most work in a job, the rest of the functions that are free for the time available are
the remaining jobs (slowly) copy copy of the splash execution system. When these tasks end, it is
reported to JobTracker. If other copies are encouraging, Hudhoft dismays the tasktakers and
dismiss the output.
Hoodab is a normal natural process. To disable, set mapred.map.tasks.speculative.execution and

mapred.reduce.tasks.speculative.execution Invalid job options
Q160) What happens if you run a hood job?
Answer: It will throw an exception that the output file directory already exists.
To run MapReduce task, you need to make sure you do not have a pre-release directory in HDFS.
To delete the directory before you can work, you can use the shell: Hadoop fs -rmr / path / to /
your / output / or via Java API: FileSystem.getlocal (conf) .delete (outputDir, true);
Get Hadoop Course Now!
Q161) How can you adjust the Hadoopo code?
Answer:
First, check the list of currently running MapReduce jobs. Next, we need to see the orphanage
running; If yes, then you have to determine the location of the RM records.
Run: “ps -ef | grep -I ResourceManager”

And search result log in result displayed. Check the job-id from the displayed list and check if
there is any error message associated with the job.
Based on RM records, identify the employee tip involved in executing the task.
Now, log on to that end and run – “ps -ef | grep -iNodeManager”
Check the tip manager registration. Major errors reduce work from user level posts for each
diagram.
Q162) How should the reflection factor in FFAS be constructed?
Answer:
Hdfs-site.xml is used to configure HDFS. Changing the dfs.replication property on hdfs-site.xml

will change the default response to all files in HDFS.
You can change the reflection factor based on a file you are using
Hadoop FS shell: [training @ localhost ~] $ hadoopfs -setrep -w 3 / n / fileConversely,
You can also change the reflection factors of all the files under a single file.
[Training @ localhotel ~] $ hadoopfs-setrep -w 3 -R / my / dir
Now go through the Hadoop administrative practice to learn about the reflection factor in HDFS!
Q163) How to control the release of the mapper, but does the release issue not?
Answer:
To achieve this summary, you must set:
conf.set (“mapreduce.map.output.compress”, true)
conf.set (“mapreduce.output.fileoutputformat.compress”, incorrect)
Q164) Which companies use a hoop?
Answer: Learn how Big Data and HADOOP have changed the rules of the game in this blog post.
Yahoo (the largest contribution to the creation of the hawkop) – Yahoo search engine created for
Hadoop, Facebook – Analytics, Amazon, Netflix, Adobe, Ebay, Spadys, Twitter, Adobe.
Q165) Do I have to know Java to learn the habit?
Answer: The ability of MapReduce in Java is an additional plus but not needed. … learn the
Hadoop and create an excellent business with Hadoo, knowing basic basic knowledge of Linux
and Java Basic Programming Policies
Q166) What should you consider when using the second name line?
Answer: Secondary mode should always be used on a separate separate computer.
This prevents intermittent interaction with the mainstream.
Q167) Name the Hadoop code as executable modes.
Answer: There are various methods to run the Hadoop code –
Fully distributed method
Pseudosiphrit method
Complete mode
Q168)Name the operating system supported by the hadoop operation.
Answer: Linux is the main operating system. However, it is also used as an electric power
Windows operating system with some additional software.
Q169) HDFS is used for applications with large data sets, not why Many small files?
Answer: HDFS is more efficient for a large number of data sets, maintained in a file
Compared to smaller particles of data stored in multiple files. Saving NameNode
The file system metadata in RAM, the amount of memory that defines the number of files in the
HDFS file
System. In simpler terms, more files will generate more metadata, which means more
Memory (RAM). It is recommended that you take 150 bytes of a block, file or directory
metadata.
Q170) What are the main features of hdfssite.xml?
Answer: There are three important properties of hdfssite.xml:
data.dr – Identify the location of the data storage.
name.dr – Specify the location of the metadata storage and specify the DFS is located
On disk or remote location.

checkpoint.dir – for the second name name.
Q171) What are the essential hooping tools that improve performance? Big data?
Answer: Some of the essential hoopoe tools that enhance large data performance –
Hive, HDFS, HBase, Avro, SQL, NoSQL, Oozie, Clouds, Flume, SolrSee / Lucene, and ZooKeeper
Q172) What do you know about Fillil soon?
Answer: The sequence is defined as a flat file containing the binary key or value pairs. This is
important
Used in MapReduce’s input / output format. Graphical publications are stored locally
SequenceFile.
Several forms of sequence –
Summary of record key / value records – In this format, the values are compressed.
Block compressed key / value records – In this format, the values and keys are individually
The blocks are stored and then shortened.
Sticky Key / Value Entries – In this format, there are no values or keys.
Get 100% Placement Oriented Training in Hadoop!
Q173) Explain the work tracker’s functions.
Answer: In Hadoop, the work tracker’s performers perform various functions, such as –
It manages resources, manages resources and manages life cycle Tasks.
It is responsible for finding the location of the data by contacting the name Node.
It performs tasks at the given nodes by finding the best worker tracker.
Work Tracker Manages to monitor all task audits individually and then submit The overall job for
the customer.
It is responsible for supervising local servicemen from Macpute’s workplace Node.
Q174) The FASAL is different from NAS?
Answer: The following points distinguish HDFS from NAS –
Hadoop shared file system (HDFS) is a distributed file system that uses data
Network Attached Storage (NAS) is a file-wide server
Data storage is connected to the computer network.
HDFS distributes all databases in a distributed manner
As a cluster, NAS saves data on dedicated hardware.
HDFS makes it invaluable when using NAS using materials hardware
Data stored on highhend devices that include high spending
The HDFS work with MapReduce diagram does not work with MapReduce
Data and calculation are stored separately.
Q175)Does the HDFS go wrong? If so, how?
Answer: Yes, HDFS is very mistaken. Whenever some data is stored in HDFS, name it Copying
data (copies) to multiple databases. Normal reflection factor is 3. It needs to be changed
according to your needs. If DataNode goes down, NameNode will take Copies the data from
copies and copies it to another node, thus making the data available automatically. TheThe way,
as the HDFS is the wrong tolerance feature and the fault tolerance
Q176) Distinguish HDFS Block and Input Unit.
Answer: The main difference between HDFS Block and Input Split is HDFS Black. While the
precise section refers to the input sector, the business section of the data is knownData. For
processing, HDFS first divides the data into blocks, and then stores all the packages Together,
when MapReduce divides the data into the first input section then allocate this input and divide
it Mapper function.
Q177) What happens when two clients try to access the same file on HDFS?
Answer: Remember that HDFS supports specific characters Only at a time).
NName client nameNode is the nameNode that gives the name Node Lease the client to create
this file. When the second client sends the request to open the same file To write, the lease for
those files is already supplied to another customer, and the name of the name Reject second
customer request.
Q178) What is the module in HDFS?
Answer: The location for a hard drive or a hard drive to store data As the volume. Store data
blocks in HDFS, and then distributed via the hoodo cluster. The entire file is divided into the first
blocks and stored as separate units.
Q179) What is Apache?
Answer: YARN still has another resource negotiation. This is a hoodup cluster Management
system. It is also the next generation introduced by MapReduce and Hoodab 2 Account
Management and Housing Management Resource Management. It helps to further support the
hoodoop Different processing approaches and wide-ranging applications.
Q180) What is the terminal manager?
Answer: Node Manager is TARStracker’s YARN equivalent. It takes steps from it Manages
resourceManager and single-source resources. This is the responsibility Containers and
ResourceManager monitor and report their resource usage. Each Single container processes
operated at slavery pad are initially provided, monitored and tracked By the tip manager
associated with the slave terminal.
Q181) What is the recording of the Hope?
Answer: In Hadoop, RecordReader is used to read a single log split data. This is important
Combining data, Hatopo divides data into various editions. For example, if input data is
separated
Row1: Welcome
Line 2: The Hoodah’s World
Using RecordReader, it should be read as “Welcome to the Hope World”.
Q182) Shorten up the mappers do not affect the Output release?
Answer:
In order to minimize the output of the maple, the output will not be affected and set as follows:
Conf.set (“mapreduce.map.output.compress”, true)
Conf.set (“mapreduce.output.fileoutputformat.compress”, incorrect)
Q183) A Reducer explain different methods.

Answer:
Various methods of a Reducer include:
System () – It is used to configure various parameters such as input data size.
Syntax: general vacuum system (environment)
Cleaning () – It is used to clean all temporary files at the end of the task.
Syntax: General Vacuum Cleanup (Eco)
Reduce () – This method is known in the heart of Rezar. This is used once
A key to the underlying work involved.
Syntax: reduce general void (key, value, environment)
Q184) How can you configure the response factor in the HDFL?
Answer: For the configuration of HDFS, the hdfssite.xml file is used. Change the default value The
reflection factor for all the files stored in HDFS is transferred to the following asset hdfssite.xml
dfs.replication
Q185) What is the use of the “jps” command?
Answer: The “Jps” command is used to verify that the Hadoop daemons state is running. TheList
all hadoop domains running in the command line. Namenode, nodemanager, resource manager,
data node etc
Q186) What is the next step after Mapper or Mumpask?:

Answer: The output of the map is sorted and the partitions for the output will be created. The
number of partitions depends on the number of disadvantages.
Q187) How do we go to the main control for a certain reduction?
Answer: Any Reducer can control the keys (through which posts) by activating the custom
partition.
Q188) What is the use of the coordinator?
Answer: It can be specified by Job.setCombinerClass (ClassName) to make local integration with

a custom component or class, and intermediate outputs, which helps reduce the size of the
transfers from the Mapper to Reducer.
Q189) How many maps are there in specific jobs?
Answer: The number of maps is usually driven by total inputs, that is, the total volume of input
files.
Usually it has a node for 10-100 maps. The work system takes some time, so it is best to take at
least a minute to run maps. If you expect 10TB input data and have a 128MB volume, you will
end up with 82,000 maps, which you can control the volume of the mapreduce.job.maps
parameter (this only provides a note structure). In the end, the number of tasks are limited by
the number of divisions returned by the InputFormat.getSplits () over time (you can overwrite).
Q190) What is the use of defect?
Answer: Reducer reduces the set of intermediate values, which shares one key (usually smaller)
values. The number of job cuts is set by Job.setNumReduceTasks (int).
Q191) Explain Core modalities of deficiency?
Answer: The Reducer API is similar to a Mapper, a run () method, which modes the structure of
the work and the reconfiguration of the reconfiguration framework from reuse. Run () method
once (), minimize each key associated with the task to reduce (once), and finally clean up the
system. Each of these methods can be accessed using the context structure of the task using
Context.getConfiguration ().
As for the mapper type, these methods may be violated with any or all custom processes. If
none of these methods are violated, the default reduction action is a symbolic function; Values
go further without processing. Reducer heart is its reduction (method). This is called a one-time
one; The second argument is Iterable, which provides all the key related values.
Q192) What are the early stages of deficiency?
Answer: Shake, sort and lower.
Q193) Shuffle’s explanation?
Answer: Reducer is a sorted output of input mappers. At this point, the configuration receives a
partition associated with the output of all the mappers via HTTP.
Q194) Explain the Reducer’s Line Stage?
Answer: Structured groups at this point are Reducer entries with the keys (because different
movers may have the same key output). Mixed and sequence phases occur simultaneously; They
are combined when drawing graphic outputs (which are similar to the one-sequence).
Q195) Explain Criticism?
Answer: At this point the reduction (MapOutKeyType, Iterable, environment) method is grouped
into groups for each group. Reduction work output is typically written to FileSystem via
Context.write (ReduceOutKeyType, ReduceOutValType). Applications can use application
progress status, set up application level status messages, counters can update, or mark their
existence. Reducer output is not sorted.

Hadoop Interview 1

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Hadoop Interview 1

Încărcat de

Drepturi de autor:

Formate disponibile

A Hadoop cluster is a collection of nodes.

Q1) What is Big Data?

Q2) Why Big Data?

Q3) What are 3 core dimension of Big Data.

Answer: Big Data have 3 core dimensions:

Q4) Role of Volume in Big Data

Q5) Role of variety in Big Data

Q6) Role of Velocity in Big Data

Q8) Role of Veracity in Big Data

Q9) Role of Value in Big Data

Q10) What is Hadoop?

Q11) Why Hadoop?

Q12) Connection between Hadoop and Big Data

Q13) Hadoop and Hadoop Ecosystem

Q14) What is HDFS.

Q15) Why HDFS?

Q16) What is YARN

Q17) Use of YARN.

Q18) What is MapReduce?

Answer: MapReduce is a programming approach to process our data. MapReduce is use to

Q20) What is Pig?

Q21) Use of Pig

Q22) What is Pig Latin

Q23) What is Hive?

Q24) Use of Hive

Q25) What is HQL?

Q27) Use of Sqoop?

Q28) What are other components of Hadoop Ecosystem?

Answer: Below are other components of Hadoop Ecosystem:

Q29) Difference Between Hadoop and HDFS

Q30) How to access HDFS

Answer: below is command:

hdfs fs or hdfs dfs

Q31) How to create directory in HDFS

Answer: below is command:

Q32) How to keep files in HDFS

Answer: below is command:

hdfs fs -put <source_file_path> <destination_file_path>

hdfs fs -copyfromLocal <source_file_path> <destination_file_path>

Q33) How to copy file from HDFS to local

Answer: below is command:

hdfs fs -copyToLocal <source_file_path> <destination_file_path>

Q34) How to Delete directory from HDFS

Answer: below is command:

hdfs fs -rm <dir_name>

Q35) How to Delete file from HDFS

Answer: below is command:

Become an Big Data Hadoop Certified Expert in 25Hours

Q36) How to Delete directory and files recursively from HDFS

Answer: below is command:

hdfs fs -rm -r <file_path>

Q37) How to read file in HDFS

Answer: below is command:

hdfs fs -cat <file_path?

Q38) What are the other file system available in market?

Q40) What is data ingestion?

Q42) What is data processing in big data?

Q43) What is unstructured data?

Q45) What are the most important qualities of Hadoop?

Q46) What do you mean by Open Source?

Q47) What do you mean by Distributed processing?

Q48) What is Fault tolerance in Hadoop?

Q49) What is High availability in Hadoop?

Q50) What is replication factor in Hadoop?