Documente Academic
Documente Profesional
Documente Cultură
UNIT-I
• Marketing
• Finance
• Government
• Healthcare
• Insurance
• Retail
It is the process of organizing data into informational summaries in order to monitor how
different areas of a business are performing.
5.What is analysis?
It is the process of exploring data and reports in order to extract meaningful insights which
can be used to better understand and improve business performance.
1
Data Analytics
MapReduce provides a data parallel programming model for clusters of commodity machines. It
is pioneered by google which process 20PB of data per day. MapReduce is popularized by
Apache Hadoop project and used by Yahoo, Facebook, Amazon and others.
In cloud computing, cheap nodes fail, especially when you have many of them. Mean time
between failures(MTBF) for 1 node = 3 years – MTBF for 1000 nodes = 1 day and commodity
network has low bandwidth.
12.What are the factors to be considered while selecting the sample in statistics?
The sample should be
2
Data Analytics
15. What are the basic differences between relational database and HDFS?
(NOV/DEC 2014)
Here are the key differences between HDFS and relational database:
RDBMS relies on the structured data Any kind of data can be stored into
Data Types and the schema of the data is always Hadoop i.e. Be it structured,
known. unstructured or semi-structured.
In RDBMS, reads are fast because The writes are fast in HDFS because
Read/Write
the schema of the data is already no schema validation happens during
Speed
known. HDFS write.
Best Fit Use Case RDBMS is used for OLTP (Online Hadoop is used for Data discovery,
Trasanctional Processing) system. data analytics or OLAP system.
3
Data Analytics
HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for
storing different kinds of data as blocks in a distributed environment. It follows master and slave
topology.
NameNode: NameNode is the master node in the distributed environment and it maintains
the metadata information for the blocks of data stored in HDFS like block location,
replication factors etc.
DataNode: DataNodes are the slave nodes, which are responsible for storing data in the
HDFS. NameNode manages all the DataNodes.
YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which
manages resources and provides an execution environment to the processes.
18. Write about the various Hadoop daemons and their roles in a Hadoop cluster.
Generally approach this question by first explaining the HDFS daemons i.e. NameNode,
DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e.
ResorceManager and NodeManager, and lastly explaining the JobHistoryServer.
NameNode: It is the master node which is responsible for storing the metadata of all the
files and directories. It has information about blocks, that make a file, and where those
blocks are located in the cluster.
Datanode: It is the slave node that contains the actual data.
Secondary NameNode: It periodically merges the changes (edit log) with the FsImage
(Filesystem Image), present in the NameNode. It stores the modified FsImage into
persistent storage, which can be used in case of failure of NameNode.
4
Data Analytics
Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active
“NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in
the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.
1. Standalone (local) mode: This is the default mode if we don’t configure anything. In this
mode, all the components of Hadoop, such NameNode, DataNode, ResourceManager, and
NodeManager, run as a single Java process. This uses the local filesystem.
2. Pseudo-distributed mode: A single-node Hadoop deployment is considered as running
Hadoop system in pseudo-distributed mode. In this mode, all the Hadoop services,
including both the master and the slave services, were executed on a single compute node.
3. Fully distributed mode: A Hadoop deployments in which the Hadoop master and slave
services run on separate nodes, are stated as fully distributed mode.
UNIT – II
5
Data Analytics
Rule induction is an area of machine learningin which formal rules are extracted from a
set of observations. The rules extracted may represent a full scientific modelof the data, or
merely represent local patternsin the data.
oGeneral to specific
oSpecific to general
• Single layer
• Multi layer
• Recurrent
• Self-organized
More than data mining tasks such as prediction, classification, etc., fuzzy models can give insight
to the underlying system and an be automatically derived from system’s dataset. For achieving
this, the technique used is grid based rule set.
6
Data Analytics
The fuzzy modeling can be interpreted as a qualitative modeling scheme by which the system
behavior is qualitatively described using a natural language. A fuzzy qualitative model is a
generalized fuzzy model consisting of linguistic explanations about system behavior in the
framework of fuzzy logic instead of mathematical equations with numerical values or
conventional logical formula with logical symbols.
A time series is a sequential set of data points, measured typically at successive times. It is
mathematically defined as a set of vectors x(t), t=0,1,2,… where t represents the time elapsed.
The Variable x9t0 is treated as a random variable.
-
• Minimal requirements for domain knowledge to determine input parameters
• Constraint based clustering
7
Data Analytics
Hierarchical methods
Y=a+bX
15. State the types of linear model and state its use.
Generalized linear model represent the theoretical foundation on which linear regression can be
applied to the modeling of categorical response variables. The types of generalized linear model
are
Logistic regression
Poisson regression
16. Write the preprocessing steps that may be applied to the data for classification and
prediction.
a. Data Cleaning
b. Relevance Analysis
c. Data Transformation
8
Data Analytics
22.Define Confidence.
Confidence is the ratio of the number of transactions that include all items in the consequent as
well as antecedent to the number of transactions that include all items in antecedent. Confidence
is an association rule interestingness measure.
I
9
Data Analytics
UNIT - III
A data stream is a real-time, continuous and ordered sequence of items. It is not possible to
control the order in which the items arrive, nor it is feasible to locally store a stream in its
entirety in any memory device.
Data Stream Mining is the process of extracting useful knowledge from continuous, rapid
data streams. Many traditional data mining algorithms can be recast to work with larger datasets,
but they cannot address the problem of a continuous supply of data.
Sensor networks are a huge source of data occurring in streams. They are used in numerous
situations that require constant monitoring of several variables, based on which important
decisions are made. in many cases, alerts and alarms may be generated as a response to the
information received from a series of sensors.
One-Time queries are queries that are evaluated once over a point-in-time snapshot of the data
set, with the answer returned to the user.
Eg: A stock price checker may alert the user when a stock price crosses a particular price point.
Biased reservoir sampling is defined as bias function to regulate the sampling from the stream.
The bias gives a higher probability of selecting data points from recent parts of the stream as
compared to distant past.
10
Data Analytics
10.What is RTSA?
Real-Time Sentiment analysis (also known as opinion mining) refers to the use of natural
language processing text analysis and computational linguistics to identify and extract subjective
information in source materials.
It is process that abstracts a large set of task-relevant data in a database from a relatively low
conceptual to higher conceptual levels 2 approaches for Generalization
These method collets the task-relevant data using a relational database query and then perform
generalization based on the examination in the relevant set of data.
-
scoring function (e.g. Least nodes, most robust, least assumptions)
12
Data Analytics
Unit-IV
The Association Rule Mining is main purpose to discovering frequent itemsets from a
large dataset is to discover a set of if-then rules called Association rules. The form of an
association rules is I→j, where I is a set of items(products) and j is a particular item.
FOR(each basket):
13
Data Analytics
Toivonen’s algorithm makes only one full pass over the database. The algorithm thus produces
exact association rules in one full pass over the database. The algorithm will give neither false
negatives nor positives, but there is a small yet non-zero probability that it will fail to produce
any answer at all. Toivonen’s algorithm begins by selecting a small sample of the input dataset
and finding from it the candidate frequent itemsets.
o Collaborative filtering
oCustomer segmentation
oData summarization
oDynamic trend detection
oMultimedia data analysis
oBiological data analysis
oSocial network analysis
o Single-link clustering
oComplete-link clustering
oAverage-link clustering
oCentroid link clustering
8.Define CLIQUE
CLIQUE is a subspace clustering algorithm that automatically finds subspaces with highdensity
clustering in high dimensional attribute spaces. CLIQUE is a simple grid-based method for
finding density-based clusters in subspaces. The procedure for this grid-baased clustering is
relatively simple.
The family of algorithms is of the point-assignment type and assumes a Euclidean space. It is
assumed that there are exactly k clusters for some known k. After picking k initial cluster
centroids, the points are considered one at a time and assigned to the closest centroid.
14
Data Analytics
15
Data Analytics
16
Data Analytics
UNIT-V
2.What is hive?
Hive provides a warehouse structure for other Hadoop input sources and SQL-Like access for
data in HDFS. Hive’s query language, HiveQL, compiles to MapReduce and also allows user-
defined functions(UDFS).
I
3.What are the responsibilities of MapReduce Framework?
17
Data Analytics
The key-value store uses a key to access a value. The key-value store has a schema-less format.
The key can be artificially generated or auto-generated while the value can be string, JSON,
BLOB, etc. the key-value uses a hash table with a unique key and a pointer to a particular item of
data.
5.What is visualization? What are the three major goals in visualization. (APR/MAY 2019)
6.What is sharding?
Horizontal partitioning of a large database leads to partitioning of rows of the database. Each
partition forms part of a shard, meaning small part of the whole. Each part can be located on a
separate database server or any physical location.
7. Write about data import in R language (get solved code examples for hands-on
experience)?
18
Data Analytics
R Commander is used to import data in R language. To start the R commander GUI, the user must
type in the command Rcmdr into the console. There are 3 different ways in which data can be
imported in R language-
• Users can select the data set in the dialog box or enter the name of the data set (if they
know).
• Data can also be entered directly using the editor of R Commander via Data->New Data
Set. However, this works well when the data set is not too large.
• Data can also be imported from a URL or from a plain text file (ASCII), from any other
statistical package or from the clipboard.
8) Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be
output of vector Z that is defined as Z <- X*Y.
In R language when the vectors have different lengths, the multiplication begins with the smaller
vector and continues till all the elements in the larger vector have been multiplied.
Z <- (3, 4, 4)
10) R language has several packages for solving a particular problem. How do you make a
decision on which one is the best to use?
CRAN package ecosystem has more than 6000 packages. The best way for beginners to answer
this question is to mention that they would look for a package that follows good software
development principles. The next thing would be to look for user reviews and find out if other
data scientists or analysts have been able to solve a similar problem.
It is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their
schema and location) and partitions in a relational database. It also provides client access to this
information with the help of metastore service API
Hive offers an embedded Derby database instance backed by the local disk for the metastore, by
default. To this concept what we call embedded metastore configuration.
Arithmetic Operators
Relational Operators
Logical Operators
String Operators
For decomposing table data sets into more manageable parts, Apache Hive offers another
technique. That technique is called as a Bucketing in Hive.
In Hive Tables or partition are subdivided into buckets based on the hash function of a column in
the table to give extra structure to the data that may be used for more efficient queries.
Internal Table (Managed table): Managed table is also Known as Internal table. This is the
default table in Hive. When user create a table in Hive without specifying it as external, by default
we will get a Managed table.
If we create a table as a managed table, the table will be created in a specific location in HDFS.
External Tables:External table is mostly created for external use as when the data is used outside
Hive. Whenever we want to delete the table’s metadata and we want to keep the table’s data as it
is, we use External table. External table only deletes the schema of the table.
20
Data Analytics
User Interface
Compiler
Metastore
Driver
Execute Engine
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data
from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File
System format that data is stored in. Users are able to write files to HDFS with whatever
tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA
INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive.
A SerDe is a powerful (and customizable) mechanism that Hive uses to "parse" data stored in
HDFS to be used by Hive.
3. Drop database
21
Data Analytics
HCatalog is a table and storage management layer for Hadoop that enables users with different
data processing tools — Pig, MapReduce — to more easily read and write data on the grid.
Hcatalog can be used to share data structures with external systems. Hcatalog provides access to
hive metastore to users of other tools on Hadoop so that they can read and write data to hive data
warehouse.
20. Why HIVE does not store meta data information in HDFS.
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main
reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are
time consuming processes.
Numeric Types. TINYINT (1-byte signed integer, from -128 to 127) SMALLINT (2-byte
signed integer, from -32,768 to 32,767) ...
Date/Time Types. TIMESTAMP. DATE.
String Types. STRING. VARCHAR. ...
Misc Types. BOOLEAN. ...
Complex Types. arrays: ARRAY<data_type>
22