Sunteți pe pagina 1din 64

Q1. What is Big Data? Why it is important? Explain its Characteristics.

Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year.
This rate is still growing enormously. Though all this information produced is meaningful
and can be useful when processed, it is being neglected.
Big data is a term that describes the large volume of data both structured and
unstructured that inundates a business on a day-to-day basis. But its not the amount of data
thats important. Its what organizations do with the data that matters. Big data can be
analysed for insights that lead to better decisions and strategic business moves.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques
What Comes Under Big Data?
Big data involves the data produced by different devices and applications.
Black Box Data :
It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight
crew, recordings of microphones and earphones, and the performance information of the
aircraft.
Social Media Data :
Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
Stock Exchange Data:
The stock exchange data holds information about the buy and sell decisions
made on a share of different companies made by the customers.
Power Grid Data:
The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data:
Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data :
Search engines retrieve lots of data from different databases.

1|Page
Why is big data analytics important?
Big data analytics helps organizations harness their data and use it to identify new
opportunities.
1. Cost reduction.
Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data plus they
can identify more efficient ways of doing business.
2. Faster, better decision making.
With the speed of Hadoop and in-memory analytics, combined with the ability
to analyze new sources of data, businesses are able to analyze information
immediately and make decisions based on what theyve learned.
3. New products and services.
With the ability to gauge customer needs and satisfaction through analytics
comes the power to give customers what they want. Davenport points out that with
big data analytics, more companies are creating new products to meet customers
needs.

Characteristics of Big Data:


1) Volume (amount of data the size of the data set)
Volume Refers to the vast amounts of data generated every second. This makes most
data sets too large to store and analyse using traditional database technology. New big data
tools use distributed systems so that we can store and analyse data across databases that are
dotted around anywhere in the world.
Nowadays, with decreasing storage costs, better storage solutions like Hadoop and the
algorithms to create meaning from all that data this is not a problem at all.

2) Velocity (speed of data in and out or data in motion)


Velocity Refers to the speed at which new data is generated and the speed at which
data moves around. Technology allows us now to analyse the data while it is being generated
(sometimes referred to as in-memory analytics), without ever putting it into databases.
The Velocity is the speed at which the data is created, stored, analysed and visualized. The
challenge organizations have is to cope with the enormous speed the data is created and used
in real-time.

2|Page
3) Variety (range of data types, domains and sources)
Variety Refers to the different types of data we can now use. With big data
technology we can now analyse and bring together data of different types such as messages,
social media conversations, photos, sensor data, and video or voice recordings.
There are many different types of data and each of those types of data requires different types
of analyses or different tools to use.

Q2. Explain the different big data types with suitable example.

The importance of being able to manage the variety of data types. Big data
encompasses everything from dollar transactions to tweets to images to audio. Therefore,
taking advantage of big data requires all the information be integrated for analysis and data
management.
The three main types of big data:-
1. Structured data
2. Unstructured data
3. Semi-structured data

1. Structured data :-
The term structured data generally refers to data that has a defined length and format.
Structured data that has defined repeating patterns. And structured data is organized data in a
predefined format.
This kind of data accounts for about 20% of the data that is out there. Its usually
stored in a database.
For ex:-
Structured data include numbers, dates, and group of words and numbers called
strings.
Relational databases(in the form of table)
Flat files in the form of records(like tab separated files)
Multidimensional databases
Legacy databases.

3|Page
The sources of data are divided into two categories:-
1. Computer or machine generated:-
Machine-generated data generally refers to data that is created by a machine with-out
human intervention.
Ex: - Sensor data, web log data, financial data.
2. Human-generated:-
This is data that humans, in interaction with computers, supply.
Ex: - Input data, Gaming-related data.

Unstructured data:-
Unstructured data is data that does not follow a specified format. Unstructured data
refers to information that either does not have a pre-defined data model or is not organized in
a predefined manner.
Unstructured information is typically text-heavy, but may also contain data such as dates,
numbers and facts. 80% of business relevant information originates in unstructured form,
primarily text.
Sources:-
Social media:- YouTube, twitter, Facebook
Mobile data:- Text messages and location information.
Call center notes, e-mails, written comments in a survey, blog entries.
Mainly two types:-
1. Machine generated :-
It refers to data that is created by a machine without human intervention.
Ex:-Satellite images, scientific data, photographs and video.
2. Human generated :-
It is generated by humans, in interaction with computers, machines etc.
Ex:-Website content, text messages.

Semi-structured data:-
Semi-structured data is data that has not been organized into a specialized repository
such as a database, but that nevertheless has associated information, such as metadata, that
makes it more amenable to processing than raw data.

4|Page
Schema-less or self-describing structure refers to a form of structured data that
contains tags or mark-up elements in order to separate elements and generate hierarchies of
records and fields in the given data. Semi-structured data is a kind of data that falls between
structured and unstructured data.
Sources:-
File systems such as web data in the form of cookies.
Web server log and search patterns.
Sensor data.

Q3. Write down the difference between traditional Analytics & Big Data
Analytics.

Sr No. Big data analytics Traditional Analytics

1. The type of data used by big data analytics are Traditional analytics uses structured data
structured, unstructured, semi structured data. that are formatted in rows and columns.
2. The volume of data processed by big data The volume of data processed by traditional
analytics is very large i.e. 100 terabytes to analytics is tens of terabytes or less.
zettabytes.
3. There is continuous flow of data. There is no continuous flow of data.
4. Decision support based on real time data and Decision support based on historical data.
gives valuable business information and insights.
5. Data is collected for analysis from inside or Data is collected for analysis from internal
outside of organization example device data, source, records from transactional system.
social media, sensor data, and log data.
6. It uses inexpensive commodity boxes in cluster It uses specialized high end hardware and
node. software.
7. Data often physically distributed. All data are centralized.

8. This contains massive or voluminous data which The relationship between the data items can
increase the level of difficulty in figuring out the be explored easily as the number of
relationship between the data items information stored is small.
9. Data is stored in HDFS, NoSQL Data is stored in RDBMS

5|Page
Q4. What is Big Data Analytics? Explain its different types.

The different types of big data analytics are as follows:


1. Descriptive analytics
2. Predictive analytics
3. Prescriptive analytics
4. Diagnostic analytics
Descriptive Analytics:
It is based on what is happening now based on incoming data.
To mine the analytics, you typically use a real-time dashboard or email reports.
It is also referred to as data mining.
They are at the bottom of the big-data value chain, but they can be valuable for
uncovering that offer insight.
A simple example of descriptive analytics would be assessing credit risk, using past
financial performance to predict a customers likely financial performance.
Descriptive analytics can be useful in the sales cycle, for example, to categorize
customers by their likely product preferences and sales cycle.
It is the most time intensive and often produces the least volume.
Features:
1. Backward looking
2. Focused on descriptions and comparisons
3. Pattern detection and descriptions
4. MECE (mutually exclusive and collectively exhaustive) categorization
5. Category development based on similarities and differences (segmentation)

Predictive Analytics:
This type of analysis, is of likely scenarios of what might happen.
The deliverables are usually a predictive forecast.
Predictive analytics use big data to identify past patterns to predict the future.
For example, some companies are using predictive analytics for sales lead scoring.
Some companies have gone one step further use predictive analytics for the entire sales
process, social media, documents, CRM data, etc.

6|Page
Properly tuned predictive analytics can be used to support sales, marketing, or for other
types of complex forecasts.
Predictive analytics include next best offers, churn risk and renewal risk analysis.
Features:
1. Forward looking
2. Focused on non-discrete predictions of future states, relationship, and patterns
3. Description of prediction result set probability distributions and likelihoods
4. Model application
5. Non-discrete forecasting (forecasts communicated in probability distributions)

Prescriptive analytics:
This type of analysis reveals what actions should be taken.
This is the most valuable kind of analysis and usually results in rules and
recommendations for the next step.
This analysis is really valuable, but is not used that largely.
13% of the organizations are using predictive analysis of big data and only 3% are using
prescriptive analysis of big data.
It gives you a laser-like focus for a particular question.
It shows the best solution among a variety of choices, given the known parameters and
suggests options how to take advantage of a future opportunity or mitigate a future risk.
Features:
1. Forward looking
2. Focused on optimal decisions for future situations
3. Simple rules to complex models that are applied on an automated or programmatic
basis
4. Optimization and decision rules for future events

Diagnostic analytics:
It gives a look at past performance to determine what happened and why.
The result of the analysis is often an analytical dashboard.
They are used for discovery or to determine why something happened in the first place.
For example, for a social media marketing campaign, you can use descriptive analytics to
assess the number of posts, mentions, followers, fans, page views, reviews, pins, etc.

7|Page
There can be thousands of online mentions that can be distilled into a single view to see
what worked in your past campaigns and what didnt.
Features:
1. Backward looking and Focused on causal relationships and sequences
2. Target/dependent variable with independent variables/dimensions

Q5. Enlist and explain the different technologies used for handling Big Data.

1. Distributed and parallel computing for big data.


2. Hadoop
3. Cloud computing and big data.
4. In-memory computing technology for big data

1. Distributed and parallel computing for big data


In distributed computing ,multiple computing resources are connected in network and
computing tasks are distributed across these resources .This sharing of tasks increases the
speed as well as the efficiency of the system .
The distributed computing is considered faster and much efficient than traditional
methods of computing .It is also more suitable to process huge amounts of data in a
limited time.
To divide computation in subtasks , which can be handled individually by processing
units that running in parallel this call parallel system.
The growing competition in the market and the astronomical increase in the volume ,
velocity , variety , and veracity of data collected from different source ,at the same
time.are forcing organization to analys the entire data in a very short time .
Techniques of Parallel computing
Computing Method Description Uses

Cluster or Grid Cluster or grid Cluster can be created even by using


computing computing is based on a hardware components that were
connection of multiple acquired a long time back to provide
server in network. cost-effective storage options.
High performance HPC environments are HPC environments can be used to
Computing(HPC) known to offer high develop specialty and custom

8|Page
performance and scalability application for research and business
by using IMC. organization .

Difference between Distributed and parallel system


Distributed system Parallel system

An independent , autonomous system connected A computer system with several


in a network for accomplishing specific tasks. processing units attached to it.
Coordination is possible between connected A common shared memory can be
computers that have their own memory and directly accessed by every processing
CPU unit in a network.
Loose coupling of computers connected in a Tight coupling of processing resources
network ,providing access to data and remotely that are used for solving a single
located resources ,complex problem.

Hadoop
This technology develops open source software for reliable, scalable, distributed
computation.
The apache Hadoop software library is a framework that allows for a distributed
processing of a large datasets across clusters of computer using simple programming
model.
It used to distribute processing of datasets of big data using the MapReduce programming
model.
All the modules in Hadoop are designed with a fundamental assumption that hardware
failure are common occurrence & should automatically handle by the framework.
Hadoop performs well with several nodes without requiring shared memory or disks
among them.
Hadoop follows the client-server architecture in which the server works as a master and is
responsible for data distribution among clients that are commodity machines and work as
slaves to carry out all the computational tasks. The master node also perform the tasks of
job controlling, disk management and work allocation.

9|Page
Module In Hadoop : -
1. Hadoop common :- contains libraries & utilities needed by other Hadoop modules.
2. Hadoop distributed file system(HDFS) :- A distributed file system that stores data
on commodity machine.
3. Hadoop YARN :- A platform responsible for managing computing resources and
user application.
4. Hadoop mapreduce :- An implementation of the mapreduce programming model for
large scale data processing.

Cloud computing and Big data

Cloud computing provide a shared resources that comprise applications, storage


solution, computational units, network solution, develop and deployment platform,
business process, etc.
Cloud computing environment saves cost related to infrastructure in an organization
by providing a framework that can be optimized and expanded horizontally.
In cloud based platform , application can easily obtain the resources to perform
computation tasks. The cost of acquiring these resources need to be paid as per the
acquired resources and their use.
The cloud computing techniques uses data center to collect data and ensure that data
backup and recovery are automatically performed to cater to the requirements of the
business community .Both cloud computing and Big data analytics use the distributed
computing model in a similar manner.

Feature of cloud computing


Scalability :- Scale Up and Scale down at any level are possible.
Elasticity :- It means hiring certain resources , as and when required, and paying for
the resources that have been used.
Resource pooling :- Sharing of resources is allowed in a cloud , which facilitates cost
cutting through resource pooling .This is an important aspect of cloud services for Big
data analytics.
Self service :- cloud computing provide simple user interface that helps to directly
access the services they want.

10 | P a g e
Cloud Delivery model for Big data
IaaS (Infrastructure as a service) :- The huge storage and computational power
requirements for big data are fulfilled by the limitless storage space and computing ability
to obtained by the IaaS cloud.
PaaS (Platform as a service) :- PaaS offering of various vendors have started adding
various popular Big Data platform that include MapReduce and Hadoop. These offering
save organizations from a lot of hassles which may occur in managing individual
hardware component and software application.
SaaS (Software as a service) :- A organization require identifying and analyzing the
voice of customer , particularly on social media platform. The social media data and the
platform for analyzing data are provided by SaaS vendors.

In-memory computing technology for big data (IMC)


IMC is used facilitate high-speed data processing , for example , IMC can help in
tracking and monitoring consumer activities and behavior .
In IMC technology , the RAM is used for analyzing data. RAM helps us to increase the
computing speed .
The Data are stored on the primary memory so the processing data are done rapidly so the
analyzing the data can be carried out amore quick and efficient manner
The IMC helps different department or business units of an organization to access and
process the data that is relevant to them. These reduces the load on the central warehouse
as every department takes care of processing its own data.
Many technology companies are making use of this technology. For example, the in-
memory computing technology developed by SAP, called High-Speed Analytical
Appliance (HANA), uses a technique called sophisticated data compression to store data
in the random access memory. HANA's performance is 10,000 times faster when
compared to standard disks, which allows companies to analyze data in a matter of
seconds instead of long hours.
Some of the advantages of in-memory computing include:
The ability to cache countless amounts of data constantly. This ensures extremely fast
response times for searches.
The ability to store session data, allowing for the customization of live sessions and
ensuring optimum website performance.
The ability to process events for improved complex event processing

11 | P a g e
Q6. Explain different Big Data Business models.

Data as a Service (DaaS)


DaaS hinges on a value proposition for supplying large amounts of processed data
with the idea that the customers job-to-be-done is to find answers or develop
solutions for their customers.
For CPG companies partnering with large retailers as a trusted supplier this usually
begins with the POS/Inventory data supplied at the vendor level or, where
appropriate, the category level.
The granularity of data can be daily or weekly and provide historical data usually
104 weeks. While the author speaks in general terms about marketing data to
monetize it (in fact the entire article has an eye toward this), CPG companies cannot
sell retailer supplied data. This does not mean that you, as a CPG category or sales
manager cannot monetize the data. For you monetization occurs when you use
analytics to gain insights to share with your buyer(s). The goal of this activity, of
course, is to flank your competitors within the category, increasing your brands
market share within the retailer ecosystem.

12 | P a g e
Information as a Service (IaaS)

IaaS focuses on providing insights based on the analysis of processed data. In this
case the customers job-to-be-done is more about coming up with their own
conclusions or even selling an idea based on certain information.
Additionally, IaaS customers dont want to or do not have the resources to process
and analyze data. Rather they are willing to exchange value for analysis from trusted
parties.
Unlike the DaaS business model, which is about aggregation and dissemination of lots
of processed data for customers to create their own value propositions from, the IaaS
business model is all about turning data into information for customers who need
something and are willing to pay for something more tailored.

Answers as a Service (AaaS)

AaaS is focused on providing higher-level answers to specific questions rather than


simply the information that can be used to come up with an answer. CPG companies
who implement the AaaS business model do so in gain answers to answer specific
questions.
This business model, as you might guess, is the top of the pyramid when it comes big
data. The key with this business model is that given the CPG companys ability to
create real, trusted value in the answers it provides to buyers, buyers take note and
value the insightful answers provided.

Q7. What is Hadoop? Explain Hadoop system principles.

1. Hadoop is a free, Java-based programming framework that supports the processing of


large data sets in a distributed computing environment.
2. Hadoop makes it possible to run applications on systems with thousands of nodes
involving thousands of terabytes.
3. Its distributed file system facilitates rapid data transfer rates among nodes and allows the
system to continue operating uninterrupted in case of a node failure.

13 | P a g e
4. This approach lowers the risk of catastrophic system failure, even if a significant number
of nodes become inoperative.
5. Hadoop was inspired by Google's MapReduce, a software framework in which an
application is broken down into numerous small parts. Any of these parts (also called
fragments or blocks) can be run on any node in the cluster.
6. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy
elephant.
7. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the
Hadoop distributed file system (HDFS) and a number of related projects such as Apache
Hive, HBase and Zookeeper.
8. The Hadoop framework is used by major players including Google, Yahoo and IBM,
largely for applications involving search engines and advertising.

HADOOP SYSTEM PRINCIPLE


1. Scale-Out rather than Scale-Up
2. Bring code to data rather than data to code
3. Deal with failures they are common
4. Abstract complexity of distributed and concurrent applications

1. Scale-Out rather than Scale-Up


It is harder and more expensive to scale-up
Add additional resources to an existing node (CPU, RAM)
New units must be purchased if required resources cannot be added
Also known as scale vertically
Scale-Out
Add more nodes/machines to an existing distributed application
Software Layer is designed for node additions or removal
Hadoop takes this approach - A set of nodes are bonded together as a single
distributed system
Very easy to scale down as well

14 | P a g e
2. Code to data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes connected by high-
capacity link.
Many data-intensive applications are not CPU demanding causing bottlenecks in network.

Hadoop co-locates processors and storage


Code is moved to data (size is tiny, usually in KBs)
Processors execute code and access underlying local storage.

3. Failures are Common


Given a large number machines, failures are common
Large warehouses may see machine failures weekly or even daily
Hadoop is designed to cope with node failures
Data is replicated
Tasks are retried

15 | P a g e
4. Abstract Complexity
Hadoop abstracts many complexities in distributed and concurrent applications
Defines small number of components
Provides simple and well defined interfaces of interactions between these components
Frees developer from worrying about system level challenges
Race conditions, data starvation
Processing pipelines, data partitioning, code distribution etc.
Allows developers to focus on application development and business logic

Q8. Write a note on the different problems for which Hadoop is suitable.

1. Modelling True Risk


If you think about this in the context of banks or other financial institues (which is,
well, banks) this is a really useful way of burrowing deeper into your customers. You can
suck in data about their spending habits, their credit, repayments everything. Munge it all
together and squeeze out an answer on whether to lend them more money.
2. Customer Churn Analysis
Hadoop was used here to analyse how a telco retained customers. Again, data from
many different sources, including social networks AND the calls themselves (recorded and
then voice analysed, I guess) were used to work out how and why the company were losing
or gaining customers.
3. Recommendation engines
Thinking about this in terms of Google, this is like the ranking algorithm. Sucking in
a bunch of factors like; popularity, link depth, buzz on Twitter etc and then scoring links for
display in score order later.
16 | P a g e
4. Ad Targeting
Similar to the recommendation engine, but with the added dimension of the advertiser
paying a premium for better ad-space
5. Point Of Sale Transaction Analysis
On this face of it, this seems simple and straightforward; analysing the data that is
provided by your P.O.S device. However, this could also include other factors like weather
and local news, which could influence how and why consumers spend money in your store.
6. Analysing Network Data To Predict Failure
The example given here was that of an electricity company which used smart-
somethings to measure the electricity flying around their network. They could pump in past
failures and current fluctuations and then pass the whole lot into a modelling engine to
predict where failures would occur. It turned out that seemingly unconnected, small
anomolies on the system were connected after all. This data wouldnt have been able to be
mined any other way.
7. Threat Analysis/Fraud Detection
Another one for the financial sector and very similar to Modelling True Risk. Hadoop
can be used to analyse spending habits, earnings and all sorts of other key metrics to work out
a transaction is fraudulent. Yahoo! use Hadoop with this pattern to ascertain whether a certain
piece of mail heading into Yahoo! Mail is actually spam.
8. Trade Surveillance
Similar to Threat Analysis and Fraud Detection, but this time pointed squarely at the
markets, analysing gathered historical and current live data to see if there is Inside Trading or
Money Laundering afoot!
9. Search Quality
Similar to the recommendation engine. This will analyse search attempts and then try
to offer alternatives, based on data gathered and pumped into Hadoop about the links and the
things people search for.
10. Data Sandbox
This is probably the most ambigious, but the most useful Hadoop-able problem. A
data sandbox is just somewhere to dump data that you previously thought was too big, or
useless or disparate to get any meaningful data from. Instead of just chucking it away, throw
it into Hadoop (which can easily handle it) then see if there IS data you can glean from it. Its
cheap to run Hadoop and anyone can attach a datasource and push data in. It allows you to
make otherwise arbitrary queries about stuff to see if its any use!

17 | P a g e
Aggregate Data, Score Data, Present Score As Rank, which, at its simplest, is what
Hadoop can do. But the introduction of the idea of a Data Sandbox and the ability, using
Sqoop, to push the analysed data back into a relational database (for a data warehouse for
example) means that you can run Hadoop independently and prove its worth in your
business very cheaply.

Q9. Compare the RDBMS and Hadoop.



Characteristics RDBMS Hadoop
Basic Description Traditional row column databases An open source approach to storing data
used for both transactional system, in a file system across a range of
reporting, and achieving commodity hardware and processing it
utilizing parallelism (multiple system at
once)
Manufacturers Sql server, MySql, Oracle, etc Hadoop implementations by CloudEra,
Intel, Amazon, Hortonworks
Best of Read & writes Reasonable data Inexpressive storage of lots of data,
Application set (< 1B rows) structured & semi-structured.
Strength and Massive data volumes, unstructured Complex, code-based, incompatible
Weakness and semi-structured data approaches in market, writes (one at a
time)
Scalability Challenging to Scale-out Strong bias to bias to the open source
community & java
Size of data Gigabytes Petabytes
Integrity of data High (referential, typed) Low
Data Schema Static Dynamic
Access Method Batch Interactive and Batch
Scaling Nonlinear (worse than linear) Linear
Data structured Structured unstructured
Normalization of Required Not Required
data
Query Response Can be near immediate Has latency (due to batch processing)
Time

18 | P a g e
Q10. Explain Hadoop ECO system.

The Hadoop ecosystem refers to the various components of the Apache Hadoop
software library, as well as to the accessories and tools provided by the Apache Software
Foundation for these types of software projects, and to the ways that they work together.
Hadoop is a Java-based framework that is extremely popular for handling and analyzing large
sets of data.

1. MapReduce:-
MapReduce is now the most widely used general purpose computing model and
runtime system for distributed data analytics. MapReduce is based on the parallel
programming framework to process the large amounts of data dispersed across different
systems. The process is initiated when a user request is received to execute the MapReduce
program and terminated once the results are written back to HDFS. MapReduce enables the
computational processing of data stored in a file system without the requirement of loading
the data initially into a database

2. Pig:-
Pig is a platform for constructing data flows for Extract, Transform, & Load
processing and analysis of large data sets. Pig Latin, the programming language for pig
provides common data manipulation operations such as grouping, joining, & filtering. Pig
generates Hadoop MapReduce jobs to perform data flows. The pig Latin scripting language is

19 | P a g e
not only a higher level data flow language but only has operators similar to SQL(EX:-
FILTER,JOIN)
3. Hive:-
Hive is a SQL based data warehouse system for Hadoop that facilitates data
summarization, adhoc queries, and the analysis of large data sets stored in Hadoop
Compatible file system (Ex:- HDFS) and some NOSQL databases.
Hive is not a relational database but a query engine that supports the parts of SQL specific to
querying data, with some additional support for writing new tables or files, but not updating
individual records.
4. HDFS:-
HDFS is an effective, scalable, fault tolerant, and distributed approach for storing and
managing huge volumes of data. HDFS works on write once read many times approach and
this makes it capable of handling such huge volumes of data with the least possibilities of
errors caused by the replication of data.
5. Hadoop YARN:-
YARN is a core Hadoop Service that supports two major services:-
1. Resource Manager
2. Application Master
The Resource Manager is a master service that manages the node manager in each
node of the cluster. It also has scheduler that allocates system resources to specific running
applications. The scheduler however does not track the applications status. The Resource
container stores all the needed system information. It maintains detailed Resource attributes
that are important for running applications on the node and in the cluster. Application
Manager notifies the node manager if more resources are required for executing the
application.
6. HBASE:-
HBASE is one of the projects of APACHE Software Foundation that is distributed
under Apache Software License V2.0. It is a non-relational database suitable for distributed
Environment and uses HDFS as its persistence storage. HBASE facilitates reading/writing of
Big data randomly and efficiently in real time. It is highly configurable, allows efficient
management of huge amount of data, and helps in dealing with Big Data challenges in many
ways.

20 | P a g e
7. Sqoop:-
Sqoop is a tool for data transfer between hadoop and relational databases. Critical
Processes are employed by MapReduce to move data into Hadoop and back to other data
sources. Sqoop is a command line interpreter which sequentially executes sqoop commands.
Sqoop operates by selecting an appropriate import function for source data from the specified
database.
8. Zookeeper:-
Zookeeper helps in coordinating all the elements of distributed applications.
Zookeeper enables different nodes of a service to communicate and coordinate with each
other and also find other master IP addresses. Zookeeper provides a central location for
keeping information, thus acting as a coordinator that makes the stored information available
to all nodes of a service
9. Flume:-
Flume aids in transferring large amounts of data from distributed resources to a single
centralized repository. It is a robust and fault tolerant, and efficiently collects, assembles, and
transfers data. Flume is used for real time data capturing in hadoop. The simple and
extensible data model of Flume facilitates fast online data analytics.
10. Oozie :-
Oozie is an open source Apache Hadoop service used to manage and process
submitted jobs. It supports the workflow/coordination model and is highly extensible and
scalable. Oozie is a dataware service that coordinates dependencies among different jobs
executing on different platforms of hadoop such as hdfs, pig, and mapreduce
11. Mahout:-
Mahout is scalable machine learning and data mining library. There are currently 4
main groups of algorithm in Mahout:-
1. Recommendations or collective filtering
2. Classification, categorization
3. Clustering
4. Frequent item set mining, parallel frequent pattern mining.

21 | P a g e
Q11. Explain the storing & querying (reading & writing) the Big Data in
HDFS.

HADOOP DISTRIBUTED FILE SYSTEM:


Hadoop File System was developed using distributed file system design.
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

NAMENODE
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
DATANODE
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will
be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.

BLOCK
Generally the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more segments and/or stored in
individual data nodes.

22 | P a g e
These file segments are called as blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.

23 | P a g e
Q12. Draw & explain the Hadoop architecture

Hadoop follows a master slave architecture design for data storage and distributed
data processing using HDFS and MapReduce respectively.

The master node for data storage is hadoop HDFS is the NameNode and the master
node for parallel processing of data using Hadoop MapReduce is the Job Tracker.

The slave nodes in the hadoop architecture are the other machines in the Hadoop
cluster which store data and perform complex computations.
Every slave node has a Task Tracker daemon and a DataNode that synchronizes the
processes with the Job Tracker and NameNode respectively.

HDFS in Hadoop Application Architecture Implementation

A file on HDFS is split into multiple bocks and each is replicated within the Hadoop
cluster. Hadoop Distributed File System (HDFS) stores the application data and file
system metadata separately on dedicated servers.

24 | P a g e
It has two components: NameNode and DataNode
NameNode
File system metadata is stored on servers referred to as NameNode. All the files and
directories in the HDFS are represented on the NameNode. NameNode maps the entire
file system structure into memory.
DataNode
Application data is stored on servers referred to as DataNodes. HDFS replicates the
file content on multiple DataNodes DataNode manages the state of an HDFS node and
interacts with the blocks .A DataNode can perform CPU intensive jobs and I/O intensive
jobs like clustering, data import, data export, search, decompression, and indexing.

MapReduce in Hadoop Application Architecture Implementation


Map function transforms the piece of data into key-value pairs and then the keys are
sorted where a reduce function is applied to merge the values based on the key into a
single output.
The execution of a MapReduce job begins when the client submits the job
configuration to the Job Tracker that specifies the map, combine and reduce functions
along with the location for input and output data.

25 | P a g e
On receiving the job configuration, the job tracker identifies the number of splits
based on the input path and select Task Trackers based on their network vicinity to
the data sources. Job Tracker sends a request to the selected Task Trackers.
The processing of the Map phase begins where the Task Tracker extracts the input
data from the splits. Map function is invoked for each record. On completion of the
map task, Task Tracker notifies the Job Tracker.
When all Task Trackers are done, the Job Tracker notifies the selected Task Trackers
to begin the reduce phase. Task Tracker reads the region files and sorts the key-value
pairs for each key. The reduce function is then invoked which collects the aggregated
values into the output file.

Q13. Explain the different HDFS commands.



Sr No. Name Description
1. Cat Usage: hdfs dfs -cat URI [URI ...]
Copies source paths to stdout.
Example:
hdfs dfs -cat hdfs://nn1.example.com/file1
hdfs://nn2.example.com/file2
hdfs dfs -cat file:///file3 /user/hadoop/file4
2. chgrp Usage: hdfs dfs -chgrp [-R] GROUP URI [URI ...]
Change group association of files. The user must be the owner of
files, or else a super-user. Additional information is in
the Permissions Guide.
Options
The -R option will make the change recursively through the
directory structure.
3. chmod Usage:
hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]
Change the permissions of files. With -R, make the change
recursively through the directory structure. The user must be the
owner of the file, or else a super-user. Additional information is in
the Permissions Guide.

26 | P a g e
Options
The -R option will make the change recursively through the
directory structure.
4. Chown Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional
information is in the Permissions Guide.
Options
The -R option will make the change recursively through the
directory structure.
5. copyFromLocal Usage: hdfs dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local
file reference.
Options:
The -f option will overwrite the destination if it already
exists.
6. copyToLocal Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a
local file reference.
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
7. moveToLocal Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>
Displays a "Not implemented yet" message.

27 | P a g e
Q14. Write a note on Hadoop advantages & challenges.

Hadoop has also proven valuable for many other more traditional enterprises based on some
of its big advantages:
1. Scalable
Hadoop is a highly scalable storage platform, because it can store and distribute very
large data sets across hundreds of inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that cant scale to process large amounts of
data, Hadoop enables businesses to run applications on thousands of nodes involving
thousands of terabytes of data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses exploding data
sets. The problem with traditional relational database management systems is that it is
extremely cost prohibitive to scale to such a degree in order to process such massive volumes
of data. In an effort to reduce costs, many companies in the past would have had to down-
sample data and classify it based on certain assumptions as to which data was the most
valuable. The cost savings are staggering: instead of costing thousands to tens of thousands of
pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of
pounds per terabyte.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different
types of data (both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data sources such as
social media, email conversations or clickstream data.
4. Fast
Hadoops unique storage method is based on a distributed file system that basically maps
data wherever it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If youre dealing
with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of
data in just minutes, and petabytes in hours.
5. Advanced data analysis can be done in house
Hadoop makes it practical to work with large data sets and customize the outcome without
having to outsource the task to specialist service providers. Keeping operations in house helps

28 | P a g e
organizations be more agile, while also avoiding the ongoing operational expense of
outsourcing.
6. Run a commodity vs. custom architecture
Some of the tasks that Hadoop is being used for today were formerly run by MPCC and other
specialty, expensive computer systems. Hadoop commonly runs on commodity hardware.
Because it is the de facto big data standard, it is supported by a large and competitive solution
provider community, which protects customers from vendor lock-in.

Challenges of Hadoop:
1. Hadoop is a cutting edge technology
Hadoop is a new technology, and as with adopting any new technology, finding people who
know the technology is difficult!
2. Hadoop in the Enterprise Ecosystem
Hadoop is designed to solve Big Data problems encountered by Web and Social companies.
In doing so a lot of the features Enterprises need or want are put on the back burner. For
example, HDFS does not offer native support for security and authentication.
3. Hadoop is still rough around the edges
The development and admin tools for Hadoop are still pretty new. Companies like Cloudera,
Hortonworks, MapR and Karmasphere have been working on this issue. However the tooling
may not be as mature as Enterprises are used to (as say, Oracle Admin, etc.)
4. Hadoop is NOT cheap
Hardware Cost
Hadoop runs on 'commodity' hardware. But these are not cheapo machines, they are server
grade hardware.
So standing up a reasonably large Hadoop cluster, say 100 nodes, will cost a significant
amount of money.
IT and Operations costs
A large Hadoop cluster will require support from various teams like : Network Admins, IT,
Security Admins, System Admins.
Also one needs to think about operational costs like Data Center expenses: cooling,
electricity, etc.
5. Map Reduce is a different programming paradigm
Solving problems using Map Reduce requires a different kind of thinking. Engineering teams
generally need additional training to take advantage of Hadoop.

29 | P a g e
Q15. Explain different features of HBase.

1. Key Features in Hbase are is not an eventually consistent DataStore. This makes it
very suitable for tasks such as high-speed counter aggregation(Strongly consistent
reads/writes)
2. HBase tables are distributed on the cluster via regions, and regions are automatically split
and re-distributed as your data grows(Automatic sharding)
3. Automatic RegionServer failover
4. HBase supports out of the box as its distributed file system(Hadoop/HDFS Integration)
5. HBase supports massively parallelzed processing via mapreduce for using HBase as both
source and sink.(MapReduce)
6. HBase supports an easy to use Java API for programmatic access(Java Client API)
7. HBase also supports Thrift and REST for non-Java front-ends(Thrift/REST API)
8. HBase supports a Block Cache and Bloom Filters for high volume query
optimization(Block Cache and Bloom Filters)
9. HBase provides build-in web-pages for operational insight as well as JMX
metrics(Operational Management)

Use of hbase :
1. we should have milions or billions of rows and columns in table at that point only we
have use Hbase otherwise better to go RDBMS(we have use thousand of rows and
columns)
2. In RDBMS should runs on single database server but in hbase is distributed and scalable
and also run on commodity hardware.
3. Typed columns, secondary indexes, transactions, advanced query languages, etc these
features provided by Hbase,not by RDBMS.

30 | P a g e
Q16. Write down the difference between HDFS and HBase

Sr No. HDFS HBase


1. HDFS is a distributed file system which HBase on the other hand is a database that
provides redundant storage space for stores its data in a distributed filesystem.
storing files which are very huge in
sizes.
2. HDFS provides faster file read and HBase is an open source, distributed,
write mechanism, as data is stored in versioned, column-oriented, No-SQL /
different nodes in a cluster. Non-relational database management
system that runs on the top of Hadoop.
3. It is optimized for streaming access of Low latency access to small amounts of
large files. data from within a large data set.
4. We can typically store files that are in You can access single rows quickly from a
the 100s of MB upwards on HDFS and billion row table. Flexible data model to
access them through MapReduce to work with and data is indexed by the row
process them in batch mode. key.
5. HDFS files are write once files. Fast scans across tables.
6. HDFS doesn't do random reads very Scale in terms of writes as well as total
well. volume of data.

31 | P a g e
Q17. Write down the difference between RDBMS and HBase

Sr No. RDBMS HBase
1. An RDBMS is governed by its schema, hence HBase is schema less, it doesnt follow the
it follows fixed schema that describes the concept of fixed schema, hence it has flexible
whole structure of tables. schemas
2. It is column oriented, define only column
It is mostly row oriented.
families.
3. It doesnt natively scale to distributed storage. It is distributed, versioned data storage system.
4. Since, it is fixed schema, doesnt support It supports dynamic addition of column in table
addition of columns. schema.
5. It is built for narrow tables. It is built for wide tables.
6. It is vertically but hard to scale. HBase is horizontally scaled.
7. Not optimized for sparse tables. Good with sparse tables.
8. Has no query language, only 3
Has SQL as its query language.
commands: put, get & scan.
9. HDFS is underlying layer of HBase and
Supports secondary indexes and improvises provides fault tolerance and linear scalability. It
data retrieval through SQL language. doesnt support secondary indexes and support
data in key-value pair.
10. RDBMS is transactional. No transactions are there in HBase.
11. It will have normalized data. It has de-normalized data.
12. It is good for semi-structured as well as
It is good for structured data.
structure data.
13. Max. data size is in TBs. Max. data size is in hundreds of PBs.
14. Read/write throughput limits are 1000s Read/write throughput limits are millions of
queries/second. queries/second.
15. RDBMS database technology is a very
proven, consistent, matured and highly
HBase helps Hadoop overcome the challenges
supported by world best companies. Hence,
in random read and write.
this is more appropriate for real time OLTP
processing.

32 | P a g e
Q18. Explain the storage mechanism of HBase

HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent
column values are stored contiguously on the disk. Each cell value of the table has a
timestamp.
Example schema of table in HBase:

Column Oriented and Row Oriented:


Column-oriented databases are those that store data tables as sections of columns of
data, rather than as rows of data. They will have column families.

In HBase, data is stored in tables, which have rows and columns.


Table:
An HBase table consists of multiple rows.
Row:
A row in HBase consists of a row key and one or more columns with values
associated with them. Rows are sorted alphabetically by the row key as they are stored. The
goal is to store data in such a way that related rows are near each other.

33 | P a g e
Column:
A column in HBase consists of a column family and a column qualifier, which are
delimited by a colon (:) character.
Column Family:
Column families physically co locate a set of columns and their values, often for
performance reasons. Each column family has a set of storage properties, such as whether its
values should be cached in memory, how its data is compressed or its row keys are encoded,
and others. Each row in a table has the same column families, though a given row might not
store anything in a given column family.
Following image shows column families in a column-oriented database:

Timestamp:
A timestamp is written alongside each value, and is the identifier for a given version
of a value. By default, the timestamp represents the time on the RegionServer when the
data was written, but you can specify a different timestamp value when you put data into
the cell.
Conceptual View:
At a conceptual level tables may be viewed as a sparse set of rows, they are physically
stored by column family.
Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is
what makes HBase "sparse."
Time
Row Key ColumnFamilycontents ColumnFamily anchor
Stamp
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"

34 | P a g e
anchor:my.look.ca =
"com.cnn.www" t8
"CNN.com"
contents:html =
"com.cnn.www" t6
"<html>"

Physical View:
A new column qualifier (column_family:column_qualifier) can be added to an
existing column family at any time.

Row Key Time Stamp ColumnFamily anchor


"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
"com.cnn.www" t8 anchor:my.look.ca = "CNN.com"

The empty cells shown in the conceptual view are not stored at all. Thus a request for the
value of the contents:html column at time stamp t8 would return no value.

Q19. Draw and explain in detail HBase architecture.

35 | P a g e
HBase architecture consists mainly of four components:
1. HMaster
2. HRegionServer
3. HRegion
4. Zookeeper

HMaster
HMaster is the implementation of Master server in HBase architecture. It acts like
monitoring agent to monitor all Region Server instances present in the cluster and acts as
an interface for all the metadata changes.
In a distributed cluster environment, Master runs on NameNode. Master runs several
background threads.
A master is responsible for:
Coordinating the region servers
Assigning regions on start-up , re-assigning regions for recovery or load balancing
Monitoring all RegionServer instances in the cluster
Interface for creating, deleting, updating tables

HRegionServer
HRegionServer is the Region Server implementation. It is responsible for serving and
managing regions or data that is present in distributed cluster.
The region servers run on Data Nodes present in the Hadoop cluster.
HRegion servers performs the following functions:
Hosting and managing regions
Splitting regions automatically
Handling read and writes requests
Communicating with the client directly

HRegion
HRegion are the basic building elements of HBase cluster that consists of the distribution
of tables and are comprised of Column families.
It contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and HFile.

36 | P a g e
1. Memstore
When something is written to HBase, it is first written to an in-memory store
(memstore), once this memstore reaches a certain size, it is flushed to disk into a store file
(and is also written immediately to a log file for durability). The store files created on disk are
immutable.

2. HFile
An HFile contains a multi-layered index which allows HBase to seek the data without
having to read the whole file. The data is stored in HFile in key-value pair in increasing
order.

ZooKeeper
HBase uses ZooKeeper as a distributed coordination service to maintain server state in
the cluster.
Zookeeper maintains which servers are alive and available, and provides server failure
notification. Zookeeper uses consensus to guarantee common shared state.
Services provided by ZooKeeper are as follows :
Maintains Configuration information
Provides distributed synchronization
Client Communication establishment with region servers
Master servers usability of ephemeral nodes for discovering available servers in the
cluster
To track server failure and network partitions
HLog
HLog is the HBase Write Ahead Log (WAL) implementation and there is one HLog
instance per RegionServer.
Each RegionServer adds updates (Puts, Deletes) to its write-ahead log first and then to
the Memstore.

37 | P a g e
Q20. What is Pig? Explain its advantages/ benefits.

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables
data workers to write complex data transformations without knowing Java.
Pigs simple SQL-like scripting language is called Pig Latin, and appeals to developers
already familiar with scripting languages and SQL.
Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many
languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages.
The result is that you can use Pig as a component to build larger and more complex
applications that tackle real business problems.
Pig works with data from many sources, including structured and unstructured data, and store
the results into the Hadoop Data File System.Pig scripts are translated into a series of
MapReduce jobs that are run on the Apache Hadoop cluster.

Advantages of PIG
1. Decrease in development time. This is the biggest advantage especially considering
vanilla map-reduce jobs' complexity, time-spent and maintenance of the programs.
2. Learning curve is not steep, anyone who does not know how to write vanilla map-
reduce or SQL for that matter could pick up and can write map-reduce jobs; not easy
to master, though.
3. Procedural, not declarative unlike SQL, so easier to follow the commands and
provides better expressiveness in the transformation of data every step. Comparing to
vanilla map-reduce , it is much more like an English language. It is concise and unlike
Java but more like Python.
4. Since it is procedural, you could control of the execution of every step. If you want to
write your own UDF(User Defined Function) and inject in one specific part in the
pipeline, it is straightforward.
5. Speaking of UDFs, you could write your UDFs in Python.
6. Lazy evaluation: unless you do not produce an output file or does not output any
message, it does not get evaluated. This has an advantage in the logical plan, it could
optimize the program beginning to end and optimizer could produce an efficient plan
to execute.

38 | P a g e
7. Enjoys everything that Hadoop offers, parallelization, fault-tolerance with many
relational database features.
8. It is quite effective for unstructured and messy large datasets. Actually, Pig is one of
the best tool to make the large unstructured data to structured.

Q21. Write a note on Pig (like what is pig, its components, its different user
& for what it is used)

Apache Pig is a high level data flow platform for execution Map Reduce programs of
Hadoop.
The language for Pig is pig Latin.The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
Every task which can be achieved using PIG can also be achieved using java used in Map
reduce.

39 | P a g e
Apache Pig Components :
As shown in the figure, there are various components in the Apache Pig framework.
Let us take a look at the major components.
1. Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.
2. Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
3. Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
4. USES:
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used
To process huge data sources such as web logs.
To perform data processing for search platforms.
To process time sensitive data loads.

Q22. Explain Pig Latin and its key properties?

The Pig programming language is designed to handle any kind of data. Pig is made up
of two components:
the first is the language itself, which is called PigLatin. and the second is a runtime
environment where PigLatin programs are executed.
Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin
statement is an operator that takes a relation as input and produces another relation as output.
(This definition applies to all Pig Latin operators except LOAD and STORE which read data

40 | P a g e
from and write data to the file system.)
Pig Latin statements may include expressions and schemas.
Pig Latin statements can span multiple lines and must end with a semicolon ( ; ).
By default, Pig Latin statements are processed using multi-query execution.
Pig Latin statements are generally organized as follows:
-A LOAD statement to read data from the file system.
-A series of "transformation" statements to process the data.
-A DUMP statement to view results or a STORE statement to save the results.
Note that a DUMP or STORE statement is required to generate output.
Example :
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
Output:
(John)
(Mary)
(Bill)
(Joe)

Pig's language layer currently consists of a textual language called Pig Latin, which has
the following key properties:
1. Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data
analysis tasks.
Complex tasks comprised of multiple interrelated data transformations are explicitly encoded
as data flow sequences, making them easy to write, understand, and maintain.
2. Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution
Automatically, allowing the user to focus on semantics rather than efficiency.
3. Extensibility.
Users can create their own functions to do special-purpose processing.

41 | P a g e
Q23. Explain relational operators in PigLatin with syntax, example.

Apache Pig, developed by Yahoo! helps in analysing large datasets and spend less
time in writing mapper and reducer programs.
Pig enables users to write complex data analysis code without prior knowledge of Java. Pigs
simple SQL-like scripting language is called Pig Latin and has its own Pig runtime
environment where Pig Latin programs are executed.
Relational Operators:
1. FOREACH
Generates data transformations based on columns of data.
Syntax
alias = FOREACH { gen_blk | nested_gen_blk } [AS schema];
Example:
A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,
preferences:map[]);
B = foreach A generate user, id;

2. LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig
relation.
Syntax :
LOAD 'data' [USING function] [AS schema];
Example:
runt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING
PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city:
chararray }

3. FILTER:
This operator selects tuples from a relation based on a condition.
Syntax
grunt> Relation2_name = FILTER Relation1_name BY (condition);

42 | P a g e
Example:
X = FILTER A BY f3 ==3;

4. JOIN:
JOIN operator is used to perform an inner, equijoin join of two or more relations
based on common field values. The JOIN operator always performs an inner join. Inner joins
ignore null keys, so it makes sense to filter them out before the join.
Syntax
alias = JOIN alias BY {expression|'('expression [, expression ]')'} (, alias BY
{expression|'('expression [, expression ]')'} ) [USING 'replicated' | 'skewed' | 'merge']
[PARALLEL n]; \
Example:
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

5. ORDER BY:
Order By is used to sort a relation based on one or more fields. You can do sorting in
ascending or descending order using ASC and DESC keywords.
Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias
[ASC|DESC] ] } [PARALLEL n];
Example:
grunt> order_by_data = ORDER student_details BY age DESC;

6. DISTINCT:
Distinct removes duplicate tuples in a relation.
Syntax
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example:
grunt> distinct_data = DISTINCT student_details;

7. STORE:
Store is used to save results to the file system.
Syntax:
43 | P a g e
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example:
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');

8. GROUP:
The GROUP operator groups together the tuples with the same group key (key field).
The key field will be a tuple if the group key has more than one field, otherwise it will be the
same type as that of the group key. The result of a GROUP operation is a relation that
includes one tuple per group.
Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression ] [USING
'collected'] [PARALLEL n];
Example:
grunt> group_data = GROUP student_details by age;

9. LIMIT:
LIMIT operator is used to limit the number of output tuples.
Syntax
grunt> Result = LIMIT Relation_name required number of tuples
Example:
grunt> limit_data = LIMIT student_details 4;

10. SPLIT:
SPLIT operator is used to partition the contents of a relation into two or more
relations based on some expression. Depending on the conditions stated in the expression.
Syntax
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2);
Example:
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and
age>25);

44 | P a g e
Q24. Explain the data types such as tuple, bag, relation and map in
PigLatin

The data model of Pig is fully nested. A Relation is the outermost structure of the Pig
Latin data model. And it is a bag where
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.

Pig Latin Data types

S.N. Data Type Description & Example

1 int Represents a signed 32-bit integer.


Example : 8

2 long Represents a signed 64-bit integer.


Example : 5L

3 float Represents a signed 32-bit floating point.


Example : 5.5F

4 double Represents a 64-bit floating point.


Example : 10.5

5 chararray Represents a character array (string) in Unicode UTF-8 format.


Example : tutorials point

6 Bytearray Represents a Byte array (blob).

7 Boolean Represents a Boolean value.


Example : true/ false.

8 Datetime Represents a date-time.


Example : 1970-01-01T00:00:00.000+00:00

9 Biginteger Represents a Java BigInteger.

45 | P a g e
Example : 60708090709

10 Bigdecimal Represents a Java BigDecimal


Example : 185.98376256272893883

Complex Types

11 Tuple A tuple is an ordered set of fields.


Example : (raja, 30)

12 Bag A bag is a collection of tuples.


Example : {(raju,30),(Mohhammad,45)}

13 Map A Map is a set of key-value pairs.


Example : [ name#Raju, age#30]

Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a
similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a placeholder for
optional values. These nulls can occur naturally or can be the result of an operation.

Type Construction Operators

Operator Description Example

Tuple constructor operator This operator is (Raju, 30)


() used to construct a tuple.

Bag constructor operator This operator is used {(Raju, 30),


{} to construct a bag. (Mohammad, 45)}

Map constructor operator This operator is [name#Raja, age#30]


[] used to construct a tuple.

46 | P a g e
Relational Operations

Operator Description

Loading and Storing

LOAD To Load the data from the file system (local/HDFS) into a
relation.

STORE To save a relation to the file system (local/HDFS).

Filtering

FILTER To remove unwanted rows from a relation.

DISTINCT To remove duplicate rows from a relation.

FOREACH, To generate data transformations based on columns of data.


GENERATE

STREAM To transform a relation using an external program.

Grouping and Joining

JOIN To join two or more relations.

COGROUP To group the data in two or more relations.

GROUP To group the data in a single relation.

CROSS To create the cross product of two or more relations.

Sorting

ORDER To arrange a relation in a sorted order based on one or more fields


(ascending or descending).

LIMIT To get a limited number of tuples from a relation.

Combining and Splitting

47 | P a g e
UNION To combine two or more relations into a single relation.

SPLIT To split a single relation into two or more relations.

Diagnostic Operators

DUMP To print the contents of a relation on the console.

DESCRIBE To describe the schema of a relation.

EXPLAIN To view the logical, physical, or MapReduce execution plans to


compute a relation.

ILLUSTRATE To view the step-by-step execution of a series of statements.

Q25. Explain the storage mechanism of Hive. (Table, partition, bucket)

Hive is an open source volunteer project under the Apache Software Foundation. Hive
provides a mechanism to project structure onto this data, and query it using a Structured
Query Language (SQL)-like syntax called HiveQL (Hive Query Language).
This language also lets traditional map/reduce programmers plug in their custom mappers
and reducers when it's inconvenient or inefficient to express this logic in HiveQL.
It supports SQL-like access to structured data, as well as Big Data analysis with the help
of MapReduce.
HQL statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster.
Hive's primary responsibility is to provide data summarization, query and analysis.

The data is organized and stored in three different formats in HIVE :


Tables:
Similar to RDBMS tables and contain rows and columns. Hive is just layered over the
Hadoop File System (HDFS), hence tables are directly mapped to directories of
the filesystems. It also supports tables stored in other native file systems.

48 | P a g e
Partitions:
Hive tables can have more than one partition. They are mapped to subdirectories and file
systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition
in the underlying file system.
EX : Create table and load data :

CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String,
destination String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH '/home/user/sample.txt'


OVERWRITE INTO TABLE employee;
Partitioning data is often used for distributing load horizontally, this has performance
benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how Hive structures the data storage and Hive will now create
subdirectories reflecting the partitioning structure.
Eg : partition the employee table by date
ALTER TABLE employee
ADD PARTITION (year=2013)
location '/2012/part2012';
Bucketing
It is another technique for decomposing data sets into more manageable parts. The number of
buckets is fixed so it does not fluctuate with data.
If we bucket a table and use a column as the bucketing column, the value of that column will
be hashed by a user-defined number into buckets.
Assuming the number of records is much greater than the number of buckets, each bucket
will have many such records. We use CLUSTER BY clause to divide the table into buckets.
Physically each bucket is just a file in the table directory. Similar to partitioning, bucketed
tables provide faster query response than non-bucketed tables.

49 | P a g e
Ex : creating a bucketed table
CREATE TABLE bucketed_user(
name VARCHAR(64),
city VARCHAR(64),
state VARCHAR(64),
country VARCHAR(64),
phone VARCHAR(64) )
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;

Q26. Draw & explain Hive architecture.

50 | P a g e
Sr No. Unit Name Operation

1. User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).

2. Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.

3. HiveQL HiveQL is similar to SQL for querying on schema info on the


Process Metastore. It is one of the replacements of traditional approach for
Engine MapReduce program. Instead of writing MapReduce program in
Java, we can write a query for MapReduce job and process it.

4. Execution The conjunction part of HiveQL process Engine and MapReduce is


Engine Hive Execution Engine. Execution engine processes the query and
generates results as same as MapReduce results. It uses the flavor of
MapReduce.

5. HDFS or Hadoop distributed file system or HBASE are the data storage
HBASE techniques to store data into file system.

51 | P a g e
Q27. Write down the difference between RDBMS and Hive.

Sr No. HIVE RDBMS


1. Data warehouse software Traditional full database
2. Enforces schema on read time (very fast Enforces schema on write time (loading
initial load) of data is slow but query time
performance is faster)
3. Designed for write once, read many times Designed for write and read many times

4. Based on Hadoop MapReduce batch Record level updates, insert, deletes,


processing of data, full table scans and transactions & indexes are possible
table update is achieved by transforming
the data into a new table.(There
is no update, transaction and index
feature possible in Hive)
5. Max data size allowed= 100s of PB Max data size allowed = 10s of TB

6. Doesnt support OLTP Closer to OLAP Supports OLTP


(not ideal)
7. Suitable for data warehouse application Best suitable for dynamic data analysis &
where relatively static data is analyzed, where fast responses are expected
fast response time are not required and
when the data is not changing rapidly
8. Easily scalable at low cost Not that much scalable that too it is very
costly scale up

52 | P a g e
Q28 . Write a note on Apache Hive.

Apache Hive is an open-source data warehouse system for querying and analyzing large
datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a
distributed computing environment.
It is built on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy. It was developed by Facebook and later it was open-sourced. It directly store data
on top of HDFS.
Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into MapReduce jobs executed on Hadoop.
The traditional SQL queries must be implemented in the MapReduce Java API to execute
SQL applications and queries over a distributed data. Hive provides the necessary SQL
abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without
the need to implement queries in the low-level Java API.
According to the Apache Hive wiki, "Hive is not designed for OLTP workloads and does
not offer real-time queries or row-level updates. It is best used for batch jobs over large
sets of append-only data (like web logs)."
Hive supports text files (flat files), Sequence Files (flat files consisting of
binary key/value pairs) and RC Files (Record Columnar Files which store columns of a
table in a columnar database way.)

Features of Hive:
1. Familiar: Query data with a SQL-based language for querying called HiveQL or HQL.
2. Fast: Interactive response times, even over huge datasets
3. Functions: Built-in user defined functions (UDFs) to manipulate dates, strings, and other
data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by
built-in functions.
4. Scalable and Extensible: As data variety and volume grows, more commodity machines
can be added, without a corresponding reduction in performance
5. Compatible: Works with traditional data integration and data analytics tools.
6. Different storage types: Different storage types such as plain text, Sequence Files, RCFile
and others

53 | P a g e
Applications of Hive:
Best suited for Data Warehousing Applications
Data Mining
Ad-hoc Analysis
Business Intelligence
Data Visualization
Hive/Hadoop usage at Facebook:
Summarization:
E.g.: Daily/Weekly aggregations of impressions/click counts
Complex measures of user engagement
Ad-hoc Analysis:
E.g.: how many group admins broken down by state/country
Ad Optimization
Spam Detection
Application API usage patterns
Reports on pages
E.g.: Graphical representation of popularity of page and number of likes increase for a
particular page
While developed by Facebook, Apache Hive is now used and developed by other companies
such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains
uses Hive in Amazon Elastic MapReduce on Amazon Web Services.

Q29. Explain the Hive query flow/execution.

Hive Architecture
The main components of Hive are:
UI The user interface for users to submit queries and other operations to the system. As
of 2011 the system had a command line interface and a web based GUI was being
developed.
Driver The component which receives the queries. This component implements the
notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC
interfaces.

54 | P a g e
Compiler The component that parses the query, does semantic analysis on the different
query blocks and query expressions and eventually generates an execution plan with the
help of the table and partition metadata looked up from the metastore.
Metastore The component that stores all the structure information of the various tables
and partitions in the warehouse including column and column type information, the
serializers and deserializers necessary to read and write data and the corresponding HDFS
files where the data is stored.
Execution Engine The component which executes the execution plan created by the
compiler. The plan is a DAG of stages. The execution engine manages the dependencies
between these different stages of the plan and executes these stages on the appropriate
system components.

The above figure shows how a typical query flows through the system.
Step 1:-
The UI calls the execute interface to the Driver.
Step 2:-
The Driver creates a session handle for the query and sends the query to the compiler to
generate an execution plan.
Step 3 and 4:-
The compiler gets the necessary metadata from the metastore.
Step 5:-

55 | P a g e
This metadata is used to typecheck the expressions in the query tree as well as to prune
partitions based on query predicates. The plan generated by the compiler (step 5) is a
DAG of stages with each stage being either a map/reduce job, a metadata operation or an
operation on HDFS.For map/reduce stages, the plan contains map operator trees (operator
trees that are executed on the mappers) and a reduce operator tree (for operations that
need reducers).

Step 6:-
The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2
and 6.3). In each task (mapper/reducer) the deserializer associated with the table or
intermediate outputs is used to read the rows from HDFS files and these are passed
through the associated operator tree. Once the output is generated, it is written to a
temporary HDFS file though the serializer (this happens in the mapper in case the
operation does not need a reduce). The temporary files are used to provide data to
subsequent map/reduce stages of the plan. For DML operations the final temporary file is
moved to the table's location. This scheme is used to ensure that dirty data is not read (file
rename being an atomic operation in HDFS).

Step 7, 8 and 9:-


For queries, the contents of the temporary file are read by the execution engine directly
from HDFS as part of the fetch call from the Driver (steps 7, 8 and 9).

56 | P a g e
Q30. Write down the difference between partition and bucket concepts with advantages.

Hive Partition
Hive Partitioning dividing the large amount of data into number pieces of folders
based on table columns value.
Hive Partition is often used for distributing load horizontally, this has performance
benefit, and helps in organizing data in a logical fashion.

If you want to use Partition in hive then you should use PARTITIONED BY (COL1,
COL2etc.) command while hive table creation.
We can perform partition on any number of columns in a table by using hive partition
concept.
We can perform Hive Partitioning concept on Hive Tables like Managed tables or
External tables
Partitioning is works better when the cardinality of the partitioning field is not too
high.
Supposes if we perform partition on Date column then new partition directories created for
every date this very burden to name node metadata.
Partitioning works best when the cardinality of the partitioning field is not too high.

Example:
Assume that you are storing information of people in entire world spread across 196+
countries spanning around 500 crores of entries. If you want to query people from a particular
country (Vatican City), in absence of partitioning, you have to scan all 500 crores of entries
even to fetch thousand entries of a country. If you partition the table based on country, you
can fine tune querying process by just checking the data for only one country partition. Hive
partition creates a separate directory for a column(s) value.
Advantages with Hive Partition
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the
population from Vatican city returns very fast instead of searching entire
population of world.
No need to search entire table columns for a single record.

57 | P a g e
Disadvantages with Hive Partition
There is a possibility for creating too many folders in HDFS that is extra burden for
Namenode metadata.
So there is no guarantee for query optimization for all the times.

Hive Bucketing
Hive bucketing is responsible for dividing the data into number of equal parts

If you want to use bucketing in hive then you should use CLUSTERED BY (Col)
command while creating a table in Hive

We can perform Hive bucketing concept on Hive Managed tables or External tables

We can perform Hive bucketing optimization only on one column only not more than
one.

The value of this column will be hashed by a user-defined number into buckets.

bucketing works well when the field has high cardinality and data is evenly distributed
among buckets

If you want to perform queries on Date or Timestamp or other columns which are having high
records fields at that time Hive bucketing concept is perfectible.

We can assign number of number buckets while creating the table.

Bucketing also very useful in doing efficient map-side joins etc.

Clustering aka bucketing on the other hand, will result with a fixed number of files, since you
do specify the number of buckets. What hive will do is to take the field, calculate a hash and
assign a record to that bucket.

But what happens if you use lets say 256 buckets and the field youre bucketing on has a low
cardinality (for instance, its a US state, so can be only 50 different values)? Youll have 50
buckets with data, and 206 buckets with no data.

Advantages with Hive Bucketing

Due to equal volumes of data in each partition, joins at Map side will be quicker.

Faster query response like partitioning

Disadvantages with Hive Bucketing

You can define number of buckets during table creation but loading of equal volume of
data has to be done manually by programmers.

58 | P a g e
Q31. String & math operator in PigLatin

String Functions
Pig function names are case sensitive and UPPER CASE.

Pig string functions have an extra, first parameter: the string to which all the operations
are applied.

Pig may process results differently than as stated in the Java API Specification. If any
of the input parameters are null or if an insufficient number of parameters are supplied,
NULL is returned.

STRING OPERATOR:

Name Description

1. INDEXOF Returns the index of the first occurrence of a character in a


string, searching forward from a start index.

Syntax :

INDEXOF (string, 'character', startIndex)

2. LAST_INDEX_OF Returns the index of the last occurrence of a character in a


string, searching backward from a start index.

Syntax :

LAST_INDEX_OF(expression)

3. LCFIRST Converts the first character in a string to lower case.

Syntax :

LCFIRST(expression)

4. LOWER Converts all characters in a string to lower case.

Syntax :

LOWER(expression)

5. REGEX_EXTRACT Performs regular expression matching and extracts the matched


group defined by an index parameter.

59 | P a g e
Syntax :

REGEX_EXTRACT (string, regex, index)

Example :

This example will return the string '192.168.1.5'.

REGEX_EXTRACT('192.168.1.5:8020', '(.*)\:(.*)', 1);

6. REGEX_EXTRACT_ALL Performs regular expression matching and extracts all matched


groups.

Syntax :

REGEX_EXTRACT (string, regex)

Example :

This example will return the tuple (192.168.1.5,8020).

REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');

7. REPLACE Replaces existing characters in a string with new characters.

Syntax :

REPLACE(string, 'oldChar', 'newChar');

8. STRSPLIT Splits a string around matches of a given regular expression.

Syntax :

STRSPLIT(string, regex, limit)

9. SUBSTRING Returns a substring from a given string.

Syntax :

SUBSTRING(string, startIndex, stopIndex)

10. TRIM Returns a copy of a string with leading and trailing white space
removed.

Syntax :

TRIM(expression)

11. UPPER Returns a string converted to upper case.

Syntax:

UPPER(expression)

60 | P a g e
MATH Functions:
Pig function names are case sensitive and UPPER CASE.

Pig may process results differently than as stated in the Java API Specification:

o If the result value is null or empty, Pig returns null.

o If the result value is not a number (NaN), Pig returns null.

o If Pig is unable to process the expression, Pig returns an exception.

NAME DESCRIPTION

1. ABS Returns the absolute value of an expression.

Syntax

ABS(expression)

Usage

Use the ABS function to return the absolute value of an expression.

If the result is not negative (x 0), the result is returned. If the result is negative
(x < 0), the negation of the result is returned.

2. ACOS Returns the arc cosine of an expression.

Syntax :

ACOS(expression)

Usage

Use the ACOS function to return the arc cosine of an expression.

3. ASIN Returns the arc sine of an expression.

Syntax

ASIN(expression)

Usage

Use the ASIN function to return the arc sine of an expression.

4. ATAN Returns the arc tangent of an expression.

Syntax

ATAN(expression)

Usage

61 | P a g e
Use the ATAN function to return the arc tangent of an expression.

5. CBRT Returns the cube root of an expression.

Syntax

CBRT(expression)

Usage

Use the CBRT function to return the cube root of an expression.

6. CEIL Returns the value of an expression rounded up to the nearest integer.

Syntax

CEIL(expression)

Usage

Use the CEIL function to return the value of an expression rounded up to the
nearest integer. This function never decreases the result value.

Example :

X CEIL(X)

4.6 5

3.5 4

2.4 3

7. COSH Returns the hyperbolic cosine of an expression.

Syntax

COSH(expression)

Usage

Use the COSH function to return the hyperbolic cosine of an expression.

8. COS Returns the trigonometric cosine of an expression.

Syntax

COS(expression)

Usage

Use the COS function to return the trigonometric cosine of an expression.

62 | P a g e
9. EXP Returns Euler's number e raised to the power of x.

Syntax

EXP(expression)

Usage

Use the EXP function to return the value of Euler's number e raised to the power
of x (where x is the result value of the expression).

10. FLOOR Returns the value of an expression rounded down to the nearest integer.

Syntax

FLOOR(expression)

Usage

Use the FLOOR function to return the value of an expression rounded down to
the nearest integer. This function never increases the result value.

Example

X FLOOR(X)

4.6 4

3.5 3

2.4 2

11. LOG Returns the natural logarithm (base e) of an expression.

Syntax

LOG(expression)

Usage

Use the LOG function to return the natural logarithm (base e) of an expression.

12. RANDOM Returns a pseudo random number.

Syntax

RANDOM( )

Usage

63 | P a g e
Use the RANDOM function to return a pseudo random number (type double)
greater than or equal to 0.0 and less than 1.0.

13. ROUND Returns the value of an expression rounded to an integer.

Syntax

ROUND(expression)

Usage

Use the ROUND function to return the value of an expression rounded to an


integer (if the result type is float) or rounded to a long (if the result type is
double).

Example

X CEIL(X)

4.6 5

3.5 4

2.4 2

64 | P a g e

S-ar putea să vă placă și