Sunteți pe pagina 1din 33

1

DWH – 2Marks Statistics


Machine learning
Decision Tree
Unit I Hidden markov models
Artificial Intelligence
1.Define Data mining. Genetic Algorithm
It refers to extracting or “mining” Meta learning
knowledge from large amount of data. Data
mining is a process of discovering 7.Give few statistical techniques.
interesting knowledge from large amounts Point Estimation
of data Data Summarization
stored either, in database, data warehouse, or Bayesian Techniques
other information repositories Testing Hypothesis
Correlation
2.Give some alternative terms for data Regression
mining.
 K nowledge mining 8.What is the purpose of Data mining
Knowledge extraction Technique?
Data/pattern analysis. It provides a way to use various data mining
Data Archaeology tasks.
Data dredging
9.Define Predictive model.
3.What is KDD. It is used to predict the values of data by
KDD-Knowledge Discovery in Databases. making use of known results from a
different set of sample data.
4.What are the steps involved in KDD
process. 10.Data mining tasks that are belongs to
Data cleaning predictive model
Data Mining Classification
Pattern Evaluation Regression
Knowledge Presentation Time series analysis
Data Integration
Data Selection 11.Define descriptive model
Data Transformation It is used to determine the patterns and
relationships in a sample data. Data
5.What is the use of the knowledge base? mining tasks that belongs to descriptive
Knowledge base is domain knowledge that model:
is used to guide search or evaluate the Clustering
interestingness of resulting pattern. Such Summarization
knowledge can include concept hierarchies Association rules
used to organize attribute /attribute values in Sequence discovery
to different levels of abstraction.
Data Mining 12. Define the term summarization
The summarization of a large chunk of data
6.Mention some of the data mining contained in a web page or a
techniques. document.
2

Summarization = o According to levels of abstraction of the


caharcterization=generalization knowledge mined
_ Generalized knowledge (High level of
13. List out the advanced database abstraction)
systems. _ Primitive-level knowledge (Raw data
Extended-relational databases level)
Object-oriented databases o According to mine data regularities versus
Deductive databases mine data irregularities
Spatial databases Based on kinds of techniques utilized
Temporal databases o According to user interaction
Multimedia databases _ Autonomous systems
Active databases _ Interactive exploratory system
Scientific databases _ Query-driven systems
Knowledge databases o According to methods of data analysis
_ Database-oriented
_ Data warehouse-oriented
14. Define cluster analysis _ Machine learning
Cluster analyses data objects without _ Statistics
consulting a known class label. The class _ Visualization
labels are not present in the training data _ Pattern recognition
simply because they are not known to begin _ Neural networks
with.  Based on applications adopted
o Finance
15.Classifications of Data mining systems. o Telecommunication
 B ased on the kinds of databases mined: o DNA
o According to model o Stock markets
_ Relational mining system o E-mail and so on
_ Transactional mining system o
_ Object-oriented mining system
_ Object-Relational mining system 16.Describe challenges to data mining
_ Data warehouse mining system regarding data mining methodology and
o Types of Data user
_ Spatial data mining system interaction issues.
_ Time series data mining system Mining different kinds of knowledge in
_ Text data mining system databases
_ Multimedia data mining system Interactive mining of knowledge at
Based on kinds of Knowledge mined multiple levels of abstraction
o According to functionalities Incorporation of background knowledge
_ Characterization Data mining query languages and ad hoc
_ Discrimination data mining
_ Association Presentation and visualization of data
_ Classification mining results
_ Clustering Handling noisy or incomplete data
_ Outlier analysis Pattern evaluation
_ Evolution analysis
3

17.Describe challenges to data mining The basic goal of statistics is to extend


regarding performance issues. knowledge about a subset of a collection to
Efficiency and scalability of data mining the entire
algorithms collection.
Parallel, distributed, and incremental
mining algorithms 23. What are the factors to be considered
while selecting the sample in statistics?
18.Describe issues relating to the diversity The sample should be
of database types. *Large enough to be representative of the
Handling of relational and complex types population.
of data *small enough to be manageable.
Mining information from heterogeneous *accessible to the sampler.
databases and global information *free of bias.
systems
24. Name some advanced database
19.What is meant by pattern? systems.
Pattern represents knowledge if it is easily Object-oriented databases,Object-relational
understood by humans; valid on test databases.
data with some degree of certainty; and
potentially useful, novel,or validates a hunch 25. Name some specific application
about which the used was curious. Measures oriented databases.
of pattern interestingness, either objective or Spatial databases,
subjective, can be used to guide the Time-series databases,
discovery process. Text databases and multimedia databases.

20.How is a data warehouse different 26. Define Relational datbases.


from a database? A relational databases is a collection of
Data warehouse is a repository of multiple tables,each of which is assigned a unique
heterogeneous data sources, organized name.Each
under a unified schema at a single site in table consists of a set of attributes(columns
order to facilitate management decision- or fields) and usually stores a large set of
making. tuples(records
Database consists of a collection of or rows).Each tuple in a relational table
interrelated data. represents an object identified by a unique
key and described
21.What are the uses of statistics in data by a set of attribute values.
mining?
Statistics is used to 27.Define Transactional Databases.
*to estimate the complexity of a data mining A transactional database consists of a file
problem; where each record represents a transaction.A
*suggest which data mining techniques are transaction typically includes a unique
most likely to be successful; and transaction identity number(trans_ID), and a
*identify data fields that contain the most list of the items
“surface information”. making up the transaction.

22. What is the main goal of statistics? 28.Define Spatial Databases.


4

Spatial databases contain spatial-related


information.Such databases include 34. Define data mining
geographic(map) Data mining is a process of extracting or
databases,VLSI chip design databases, and mining knowledge from huge amount of
medical and satellite image data.
databases.Spatial data may be
represented in raster format, consisting of n- 35. Define pattern evaluation
dimensional bit maps or pixel maps. Pattern evaluation is used to identify the
truly interesting patterns representing
29.What is Temporal Database? knowledge based
Temporal database store time related data .It on some interesting measures.
usually stores relational data that include
time 36. Define knowledge representation
related attributes.These attributes may Knowledge representation techniques are
involve several time stamps,each having used to present the mined knowledge to the
different semantics. user.

30.What is Time-Series databases? 37. What is Visualization?


A Time-Series database stores sequences of Visualisation is for depiction of data and to
values that change with time,such as data gain intuition about data being observed. It
collected regarding the stock exchange. assists the analysts in selecting display
formats, viewer perspectives and data
31.What is Legacy database? representation
A Legacy database is a group of schema
heterogeneous databases that combines
different kinds of
data systems,such as relational or object- 38. Name some conventional visualization
oriented databases,hierarchical techniques
databases,network Histogram
databases,spread sheets,multimedia Relationship tree
databases or file systems. Bar charts
Pie charts
32. What are the steps in the data mining Tables etc.
process?
a. Data cleaning 39. Give the features included in modern
b. Data integration visualisation techniques
c. Data selection a. Morphing
d. Data transformation b. Animation
e. Data mining c. Multiple simultaneous data views
f. Pattern evaluation d. Drill-Down
g. Knowledge representation e. Hyperlinks to related data source

33. Define data cleaning 40. Define conventional visualisation


Data cleaning means removing the Conventional visualisation depicts
inconsistent data or noise and collecting information about a population and not the
necessary information population data itself
5

The general activity of querying and


41. Define Spatial Visualisation presenting text and number data from
Spatial visualisation depicts actual members DataWarehouses, as well
of the population in their feature space as a specifically dimensional style of
querying and presenting that is exemplified
42.What is Descripive and predictive data by a number of
mining? “OLAP Vendours” .The OLAP vendours
Descriptive datamining describes the data technology is nonrelational and is almost
set in a concise and summarative manner always biased on
and an explicit multidimensional cube of
presents interesting general properties of the data.OLAP databases are also known as
data. multidimensional cube
Predictive datamining analyzes the data in of databases.
order to construct one or set of models and
attempts to predict the behavior of new data 47.Explain ROLAP?
sets. ROLAP is a set of user interfaces and
applications that give a relational database a
43.Merits of Data Warehouse. dimensional
*Ability to make effective decisions from flavour.ROLAP stands for Relational Online
database Analytic Processing.
*Better analysis of data and decision support
*Discover trends and correlations that UNIT-II
benefits business
*Handle huge amount of data. 1.Define data warehouse?
A data warehouse is a repository of multiple
44.What are the characteristics of data heterogeneous data sources
warehouse? organized under a unified schema at a single
*Separate site to facilitate management decision
*Available making .
*Integrated (or)
*Subject Oriented A data warehouse is a subject-oriented,
*Not Dynamic time-variant and nonvolatile
*Consistency collection of data in support of
*Iterative Development management’s decision-making process.
*Aggregation Performance
2.What are operational databases?
45.List some of the DataWarehouse tools? Organizations maintain large database that
*OLAP(OnLine Analytic Processing) are updated by daily transactions are
*ROLAP(Relational OLAP) called operational databases.
*End User Data Access tool
*Ad Hoc Query tool 3.Define OLTP?
*Data Transformation services If an on-line operational database systems is
*Replication used for efficient retrieval, efficient
storage and management of large amounts of
data, then the system is said to be on-line
46.Explain OLAP? transaction processing.
6

9.What is data mart?


4.Define OLAP? Data mart is a database that contains a
Data warehouse systems serves users (or) subset of data present in a data warehouse.
knowledge workers in the role of data Data marts are created to structure the data
analysis and decision-making. Such systems in a data warehouse according to issues such
can organize and present data in various as hardware platforms and access control
formats. These systems are known as on-line strategies. We can divide a data warehouse
analytical processing systems. into
data marts after the data warehouse has been
5.How a database design is represented in created. Data marts are usually implemented
OLTP systems? on low-cost departmental servers that are
Entity-relation model UNIX (or) windows/NT based. The
implementation cycle of the data mart is
6. How a database design is represented likely to be measured in weeks rather than
in OLAP systems? months (or) years.
Star schema
Snowflake schema 10.What are dependent and independent
Fact constellation schema data marts?
Dependent data marts are sourced directly
7.List out the steps of the data warehouse from enterprise data warehouses.
design process? Independent data marts are data captured
_ Choose a business process to model. from one (or) more operational systems (or)
_ Choose the grain of the business process external information providers (or) data
_ Choose the dimensions that will apply to generated locally with in particular
each fact table record. department
_ Choose the measures that will populate (or) geographic area.
each fact table record.
11.Define indexing?
8.What is enterprise warehouse? Indexing is a technique, which is used for
An enterprise warehouse collects all the efficient data retrieval (or) accessing
information’s about subjects spanning the data in a faster manner. When a table grows
entire organization. It provides corporate- in volume, the indexes also increase in size
wide data integration, usually from one (or) requiring more storage.
more operational systems (or) external
information providers. It contains detailed 12.What are the types of indexing?
data as _ B-Tree indexing
well as summarized data and can range in _ Bit map indexing
size from a few giga bytes to hundreds of _ Join indexing
giga
bytes, tera bytes (or) beyond. An enterprise 13.Define metadata?
data warehouse may be implemented on Metadata is used in data warehouse is used
traditional mainframes, UNIX super servers for describing data about data.
(or) parallel architecture platforms. It (i.e.) meta data are the data that define
requires business modeling and may take warehouse objects. Metadata are created for
years to design and build. the
7

data names and definitions of the given The various data smoothing
warehouse. techniques are
• Binning
14.Define VLDB? • Clustering
Very Large Data Base. If a database whose • Combined computer and
size is greater than 100GB, then human inspection
the database is said to be very large • Regression
database.
18.What is Binning?
15.What is data cleaning? Binning is used to smooth data
Data cleaning routines remove values by consulting its neighborhood
incomplete, noisy and inconsistent data by values. The sorted values are distributed
• filling in missing values into a number of “buckets” or “bins”.
• smoothing out noise The data are first sorted and then
• identifying outliers and partitioned into equidepth bins. There
• correcting inconsistencies in the data are three types of binning
• Smoothing by bin means –
Each value is replaced by the
mean value of the bin
16.Mention the categories of data that
• Smoothing by bin median –
may be encountered in mining.
Each bin value is replaced by the
The data used in the analysis by the
bin median
data mining techniques may fall under
the following categories • Smoothing by boundaries –
The maximum and minimum
• Incomplete data – lacking values in the bin are identified as
attribute value or certain
bin boundaries. Each value in the
attributes of interest
bin is replaced by the closest
• Noisy data – Data containing boundary value
errors or outlier values that
deviate from the expected. Noise 19.What is data integration? What are
is defined as a random error or the issues to be considered while
variance in a measured variable integrating data?
• Inconsistent data – There may Data integration combines data
be inconsistencies in data from multiple sources into a coherent data
recorded in some transactions, store. Issues to be considered are
inconsistencies due to data a) Entity identification problem
integration (where a given b) Correlation analysis
attribute may have different c) Detection and resolution of
names in different database), data value conflict
inconsistency due to data
redundancy 20.What is data transformation? What
are the various methods of transforming
17.What are the various data smoothing data?
techniques to remove noise? Data transformation
transforms and consolidates data
8

into forms appropriate for When a decision tree is built, many of the
mining. The following are branches will reflect anomalies in
various methods of transforming the training data due to noise or outlier. Tree
data pruning methods address this
i. Smoothing problem of over fitting the data.
ii. Aggregation Approaches:
iii. Generalization Pre pruning
iv. Normalization Post pruning
v. Attribute construction
6. Define Pre Pruning
UNIT III A tree is pruned by halting its construction
early. Upon halting, the node
becomes a leaf. The leaf may hold the most
1. Define the concept of classification. frequent class among the subset
Two step process samples.
A model is built describing a predefined
set of data classes or concepts. 7. Define Post Pruning.
The model is constructed by analyzing Post pruning removes branches from a
database tuples described by “Fully grown” tree. A tree node is
attributes. pruned by removing its branches.
The model is used for classification. Eg: Cost Complexity Algorithm
2. What is Decision tree?
A decision tree is a flow chart like tree 8.Define information gain.
structures, where each internal The information gain measure is
node denotes a test on an attribute, each used to select the test attribute at each
branch represents an outcome of the test, node in the tree. Such a measure is
and leaf nodes represent classes or class referred to as an attribute selection
distributions. The top most in a tree is the measure or measure of goodness of
root node. split. The attribute with the highest
information gain is chosen as the test
3.What is tree pruning? attribute for the current node.
Tree pruning attempts to identify
and remove branches that reflect noise or 9. How does tree pruning work?
outliers in the training data with the goal There are two approaches to tree
of improving classification accuracy on pruning
unseen data. a. In prepruning approach, a
tree is pruned by halting its
4. What is Attribute Selection Measure? construction early. E.g. by
The information Gain measure is used to deciding not to further split
select the test attribute at each node the training samples at a
in the decision tree. Such a measure is given node. Upon halting, the
referred to as an attribute selection measure node becomes a leaf node.
or a measure of the goodness of split. b. In postpruning approach, all
branches from a fully-grown
5. Describe Tree pruning methods. tree are removed. The lowest
pruned node becomes a leaf
9

and is labeled by the most ID3 is algorithm used to build decision tree.
frequent class. The following steps are followed to built a
decision tree.
10.How are classification rules extracted a. Chooses splitting attribute with highest
from a decision tree? information gain.
The knowledge represented b. Split should reduce the amount of
in a decision tree can be information needed by large amount.
extracted and represented in the
form of classification IF-THEN 15. What is the difference between
rules. One rule is created for each “supervised” and unsupervised” learning
path from the root to a leaf node. scheme.
E.g. IF age=”<=30” AND In data mining during classification the class
student = “yes” THEN label of each training sample is provided,
buys_computer = “no” this
IF age=”<=30” AND student type of training is called supervised learning
= “no” THEN buys_computer = “yes” (i.e) the learning of the model is supervised
in that it is
told to which class each training sample
11. What are Bayesian classifiers? belongs. Eg.:Classification
Bayesian classifiers are In unsupervised learning the class label of
statistical classifiers. They can predict each training sample is not known and the
class membership probabilities such as member
the probability that a given sample or set of classes to be learned may not be
belongs to a particular class. known in advance. Eg.:Clustering

12 What is class conditional


independence? UNIT IV
Naïve Bayesian classifiers
assume that the effect of an attribute 1. Define the concept of prediction.
value on a given class is independent of Prediction can be viewed as the construction
the values of the other attributes. This and use of a model to assess the
assumption is called class conditional class of an unlabeled sample or to assess the
independence. value or value ranges of an attribute
that a given sample is likely to have.
13. What are the two components of a
belief network?
The two components of a 2.Define Clustering?
belief network are Clustering is a process of grouping the
1) A directed acyclic graph, where each physical or conceptual data object into
node represents a random variable and clusters.
each arc represents a probabilistic
dependence 3. What do you mean by Cluster
2) A conditional probability table Analysis?
(CPT) for each variable A cluster analysis is the process of analyzing
the various clusters to organize the
14. Explain ID3
10

different objects into meaningful and Interval scaled variables are continuous
descriptive objects. measurements of linear scale.
For example, height and weight, weather
4. What are the fields in which clustering temperature or coordinates for any cluster.
techniques are used? These measurements can be calculated using
Clustering is used in biology to develop Euclidean distance or Minkowski distance.
new plants and animal
taxonomies. 8. Define Binary variables? And what are
Clustering is used in business to enable the two types of binary variables?
marketers to develop new Binary variables are understood by two
distinct groups of their customers and states 0 and 1, when state is 0, variable is
characterize the customer group on basis absent and when state is 1, variable is
of purchasing. present. There are two types of binary
Clustering is used in the identification of variables,
groups of automobiles symmetric and asymmetric binary variables.
Insurance policy customer. Symmetric variables are those variables that
Clustering is used in the identification of have same state values and weights.
groups of house in a city on Asymmetric variables are those variables
the basis of house type, their cost and that have
geographical location. not same state values and weights.
Clustering is used to classify the
document on the web for information 9. Define nominal, ordinal and ratio
discovery. scaled variables?
A nominal variable is a generalization of the
5.What are the requirements of cluster binary variable. Nominal variable
analysis? has more than two states, For example, a
The basic requirements of cluster analysis nominal variable, color consists of four
are states,
Dealing with different types of attributes. red, green, yellow, or black. In Nominal
Dealing with noisy data. variables the total number of states is N and
Constraints on clustering. it is
Dealing with arbitrary shapes. denoted by letters, symbols or integers.
High dimensionality An ordinal variable also has more than two
Ordering of input data states but all these states are ordered
Interpretability and usability in a meaningful sequence.
Determining input parameter and A ratio scaled variable makes positive
Scalability measurements on a non-linear scale, such
as exponential scale, using the formula
6.What are the different types of data AeBt or Ae-Bt
used for cluster analysis? Where A and B are constants.
The different types of data used for cluster
analysis are interval scaled, binary, 10. What do u mean by partitioning
nominal, ordinal and ratio scaled data. method?
In partitioning method a partitioning
7. What are interval scaled variables? algorithm arranges all the objects into
11

various partitions, where the total number of method all the objects are arranged within a
partitions is less than the total number of big singular cluster and the large cluster is
objects. Here each partition represents a continuously divided into smaller clusters
cluster. The two types of partitioning until each cluster has a single object.
method are
k-means and k-medoids. 14. What is CURE?
Clustering Using Representatives is called
11. Define CLARA and CLARANS? as CURE. The clustering algorithms
Clustering in LARge Applications is called generally work on spherical and similar size
as CLARA. The efficiency of clusters. CURE overcomes the problem of
CLARA depends upon the size of the spherical and similar size cluster and is more
representative data set. CLARA does not robust with respect to outliers.
work
properly if any representative data set from 15. Define Chameleon method?
the selected representative data sets does not Chameleon is another hierarchical clustering
find best k-medoids. method that uses dynamic modeling.
To recover this drawback a new algorithm, Chameleon is introduced to recover the
Clustering Large Applications based drawbacks of CURE method. In this method
upon RANdomized search (CLARANS) is two
introduced. The CLARANS works like clusters are merged, if the interconnectivity
CLARA, the only difference between between two clusters is greater than the
CLARA and CLARANS is the clustering interconnectivity between the objects within
process a cluster.
that is done after selecting the representative
data sets. 16. Define Association rule mining.
Association rule mining
12. What is Hierarchical method? searches for interesting
Hierarchical method groups all the objects relationships among items in a
into a tree of clusters that are arranged given data set. Rule support and
in a hierarchical order. This method works confidence are the two measures
on bottom-up or top-down approaches. of rule interestingness.

13. Differentiate Agglomerative and 17. What is occurrence frequency of an


Divisive Hierarchical Clustering? itemset?
Agglomerative Hierarchical clustering The occurrence frequency of an
method works on the bottom-up approach. itemset is the number of transactions that
In Agglomerative hierarchical method, each contain the itemset. It is also known as
object creates its own clusters. The single frequency, support count or count of the
Clusters are merged to make larger clusters itemset.
and the process of merging continues until
all 18.What are the two steps in mining
the singular clusters are merged into one big association rules?
cluster that consists of all the objects. Association rule mining is a
Divisive Hierarchical clustering method two step process
works on the top-down approach. In this c. Find all frequent item sets
12

d. Generate strong association 23.What is multilevel association rule?


rules from the frequent If an association rule refers to a
itemsets dimension at multiple levels of
abstraction, it is called multilevel
19.How are association rules classified? association rule.
Association rules are classified as If an association rule does not
follows refer to a dimension at multiple levels of
• Based on the types of values abstraction, it is called single level
handled in the rule association rule.
• Based on the dimensions of
data involved in the rule 24. Define Apriori algorithm?
• Based on the levels of Apriori is an influential
abstraction involved in the algorithm for mining frequent itemsets
rule for Boolean association rules. The
• Based on the various algorithm uses prior knowledge of
extensions to association frequent itemset properties. Apriori
mining employs an iterative approach known as
level – wise search, where k – itemsets
20.What is a quantitative association are used to explore (k+1) – itemsets.
rule?
If a rule describes association 25. What is a cuboid?
between quantitative items or Data cubes created for varying
attributes, then it is a quantitative levels of abstraction are referred to as
association rule. In these rules, the cuboids. A data cube consists of a lattice
quantitative values for items or of cuboids. Each higher level of
attributes are partitioned into abstraction reduces the data size
intervals.
26. When we can say the association rules
are interesting?
21.What is a Boolean association rule? Association rules are considered interesting
If a rule concerns the if they satisfy both a minimum
association between the presence support threshold and a minimum
or absence of an item, it is a confidence threshold. Users or domain
Boolean association rule. experts
can set such thresholds.
22.What are single dimensional and multi
– dimensional association rules?
If the items or attributes in an 27. Explain Association rule in
association rule reference only one mathematical notations.
dimension of a data cube, it is called single Let I-{i1,i2,…..,im} be a set of items
dimensional association rule. Let D, the task relevant data be a set of
If the items or attributes in an database transaction T is a set of
association rule reference more than one items
dimension of a data cube, it is called multi – An association rule is an implication of the
dimensional association rule. form A=>B where A C I, B C I,
13

and An B= . The rule A=>B contains in i. Boolean association rule


the transaction set D with support s, ii. Quantitative association rule
where s is the percentage of transactions in  Based on the dimensions of data
D that contain AUB. The Rule A=> B involved
has confidence c in the transaction set D if c i. Single dimensional association rule
is the percentage of transactions in D ii. Multidimensional association rule
containing A that also contain B.  Based on the levels of abstraction
involved
28. Define support and confidence in i. Multilevel association rule
Association rule mining. ii. Single level association rule
Support S is the percentage of transactions  Based on various extensions
in D that contain AUB. i. Correlation analysis
Confidence c is the percentage of ii. Mining max patterns
transactions in D containing A that also
contain 31. What are the two main steps in
B. Apriori algorithm?
Support ( A=>B)= P(AUB) 1) The join step
Confidence (A=>B)=P(B/A) 2) The prune step
support.
Support is the ratio of the number of 32. What is the purpose of Apriori
transactions that include all items in the Algorithm?
antecedent and Apriori algorithm is an influential algorithm
consequent parts of the rule to the total for mining frequent item sets for
number of transactions. Support is an Boolean association rules. The name of the
association rule algorithm is based on the fact that the
interestingness measure. algorithm uses prior knowledge of frequent
Confidence. item set properties.
Confidence is the ratio of the number of
transactions that include all items in the 33. Define anti-monotone property.
consequent If a set cannot pass a test, all of its supersets
as well as antecedent to the number of will fail the same test as well.
transactions that include all items in
antecedent. Confidence is 34. How to generate association rules
an association rule interestingness measure. from frequent item sets?
Association rules can be generated as
follows
29. How are association rules mined from For each frequent item set1, generate all non
large databases? empty subsets of 1.
I step: Find all frequent item sets: For every non empty subsets s of 1, output
II step: Generate strong association rules the rule “S=>(1-s)”if
from frequent item sets Support count(1)
=min_conf,
30. Describe the different classifications of Support_count(s)
Association rule mining. Where min_conf is the minimum confidence
 Based on types of values handled in the threshold.
Rule
14

35. Give few techniques to improve the 39.What is hybrid – dimension


efficiency of Apriori algorithm. association rules?
Hash based technique Multidimensional association
Transaction Reduction rules with repeated predicates which
Portioning contain multiple occurrences of some
Sampling predicates are called hybrid –
Dynamic item counting dimension association rules

36. What are the things suffering the E.g. age (X, “20…29”) ∧buys (X,
performance of Apriori candidate “laptop”) ⇒ buys (X, “b/w printer”)
generation technique.
Need to generate a huge number of 40. Mention few approaches to mining
candidate sets Multilevel Association Rules
Need to repeatedly scan the scan the Uniform minimum support for all
database and check a large set of levels(or uniform support)
candidates by pattern matching Using reduced minimum support at lower
levels(or reduced support)
37. Describe the method of generating Level-by-level independent
frequent item sets without candidate Level-cross filtering by single item
generation. Level-cross filtering by k-item set
Frequent-pattern growth(or FP Growth)
adopts divide-and-conquer 41. What are multidimensional
strategy. association rules?
Steps: Association rules that involve two or more
->Compress the database representing dimensions or predicates
frequent items into a frequent pattern tree Interdimension association rule:
or FP tree Multidimensional association rule with no
->Divide the compressed database into a set repeated predicate or dimension
of conditional database Hybrid-dimension association rule:
->Mine each conditional database separately Multidimensional association rule with
multiple occurrences of some predicates or
dimensions.
38. Define Iceberg query.
It computes an aggregate function over an 42. Define constraint-Based Association
attribute or set of attributes in Mining.
order to find aggregate values above some Mining is performed under the guidance of
specified threshold. various kinds of constraints
Given relation R with attributes a1,a2,…..,an provided by the user.
and b, and an aggregate function, The constraints include the following
agg_f, an iceberg query is the form Knowledge type constraints
Select R.a1,R.a2,…..R.an,agg_f(R,b) Data constraints
From relation R Dimension/level constraints
Group by R.a1,R.a2,….,R.an Interestingness constraints
Having agg_f(R.b)>=threshold Rule constraints.

43.What is strong association rule?


15

Association rules that satisfy as well as antecedent to the number of


both user – specified minimum transactions that include all items in
confidence threshold and user – antecedent.
specified minimum support threshold are
referred to as strong association rules 47. What is the use of Regression?
Regression can be used to solve the
classification problems but it can also be
44.What are the various factors used to used
determine the interestingness measure? for applications such as forecasting.
1) Simplicity – the pattern Regression can be performed using many
should be simple overall for human different
comprehension types of techniques; in actually regression
2) Certainty – this is the takes a set of data and fits the data to a
validity or trustworthiness of the pattern formula.
3) Utility – this is the potential
usefulness of the pattern 48. What are the reasons for not using the
4) Novelty – novel patterns linear regression model to estimate the
provide new information or increase the output data?
performance of the pattern There are many reasons for that, One is that
the data do not fit a linear model, It is
45. Explain the various OLAP operations. possible however that the data generally do
a) Roll-up: The roll-up operation performs actually represent a linear model, but the
aggregation on a data cube, either by linear model generated is poor because noise
climbing up a concept hierarchy for a or outliers exist in the data.
dimension. Noise is erroneous data and outliers are data
b) Drill-down: It is the reverse of roll-up. It values that are exceptions to the usual and
navigates from less detailed data to more expected data.
detailed data.
c) Slice: Performs a selection on one 49. What are the two approaches used by
dimension of the given cube, resulting in a regression to perform classification?
subcube. Regression can be used to perform
46. Discuss the concepts of frequent classification using the following
itemset, support & confidence. approaches
A set of items is referred to as itemset. An 1. Division: The data are divided into
itemset that contains k items is called k- regions based on class.
itemset. An 2. Prediction: Formulas are generated to
itemset that satisfies minimum support is predict the output class value.
referred to as frequent itemset.
Support is the ratio of the number of
transactions that include all items in the 50.What is linear regression?
antecedent and In linear regression data are modeled using a
consequent parts of the rule to the total straight line. Linear regression is the
number of transactions. simplest
Confidence is the ratio of the number of form of regression. Bivariate linear
transactions that include all items in the regression models a random variable Y
consequent called response variable
16

as a linear function of another random Training data are analyzed by a


variable X, called a predictor variable. classification algorithm
Y=a+bX Trainin
g Classifi
51. What is classification?
Name age income
data n algor
A bank loan officer wants to analyze
loan_decision
which loan applicants are “safe” and which
are
“risky” for the bank.
A marketing manager needs data analysis Sandy young low risky Classifica
to help guess whether a customer with a rules
given profile will buy a new computer. Jones If age = you
In the above examples, the data analysis
task is classification where a model or loan_decisio
classifier is constructed to predict
Caroline middle high safe if income=h
categorical labels such as “ safe” or “risky”
for aged loan_decisio
loan application data.
Susan senior high safe
52. What is prediction? Classification
Suppose the marketing manager would Lake
Test data are used to estimate the
like to predict how much a given customer
will
spend during a sale. The data analysis
Classification rules
task is an example of numeric prediction.
The
term prediction is used to refer to
numeric prediction.

53. How do classifications work?


Or John Henry
Explain the steps involved in data Middle Test data New data
classification? Aged,
Data classification is a two step process:
Step1: A classifier is built describing a low
predetermined set of data classes or loan_decision = risky
concepts.
This is the learning step(training phase), 54. What is supervised learning?
where a classification algorithm builds the The class label of each training tuple is
classifier by analyzing or “learning from” provided in supervised learning (i.e. the
a training set made up of database tuples learning of the classifier is “supervised”
and their associated class labels. in that it is told to which class each training
Step 2: The model is used for tuple belongs)
classification. A test set is used, made up of
tuples and their associated class labels. Eg Learning – Training data are analyzed
Learning by a classification algorithm.
17

Training data correlation between attributes A1 and A2


Eg Name Age Income would suggest that one of the two could be
loan_decision removed from further analysis.
Sandy Jones young Attributes subset selection can be used to
low risky find a reduced set of attributes.
Caroline middle aged Relevance analysis in the form of correlation
high safe analysis and attribute subset selection can be
Susan Lake senior used to delete attributes that do not
low safe contribute to the classification prediction
task.
In the above table the class label attribute is Data transform and reduction-the data may
loan_decision and the learned model or be transformed by normalization. Data can
classification is representing the form of also be transformed by generalizing it to
classification rules. higher level concepts.
Eg. If age =young THEN loan_decision = Eg: the attribute income can be generalized
risky to discriminate ranges such as low, medium
If income=high THEN and high.
loan_decision=safe
If age=middle-aged and income=low 57. What are the criteria used in
THEN loan_decision=risky comparing classification and prediction
methods?
55. What is unsupervised learning? Accuracy-the accuracy of the classification
In unsupervised learning (or refers to the ability of the classifier to
clustering),is which the class label or each correctly predict the class label of new or
training previously unseen data. (i.e. tuples within
Tuple is not known, and the number or set class label information)
of classes to be learned may not be known in The accuracy of the prediction refers to how
advance. well the given prediction can guess the value
For eg: if we did not have the loan_decision of the predicted attributes for new or unseen
data available for the training set we could data.
use clustering to try to determine “group of Speed – this refers to the computational cost
like tuples”, which may correspond to risk involved in generating and using the given
group within the loan application data. classification or prediction.
Robustness – The ability to make correct
56. What are preprocessing steps of predications from the given noisy data or
classification and prediction process? data with missing values.
The following preprocessing steps applied to Scalability – Ability to construct the
the data to help inform the classifier or predictor efficiency given large
accuracy,efficieny and stability of the amount data.
classification or prediction process. Interpretability – This refers to the level of
data cleaning-this refers to preprocessing of understanding and insight that is provided
data in order to remove or reduce noise (by by the classifier or predictor .
applying smoothing techniques )and the
treatment of missing values 58. What are Bayesian classifiers?
Relevance analysis- many of the attributes in Bayesian classifiers are statistical
the data may be redundant. A strong classifiers. They can predict class
18

membership probabilities, such as the performed. Bayesian belief networks can be


probability that a given tuple belongs to a used for classification.
particular class. A belief network is defined by two
Bayesian classification is based on Baye’s components - a directed acyclic graph
theorem. Bayesian classifiers have exhibited (DAG) and a set of conditional probability
a high accuracy and speed when applied to tables. The DAG represents a random
large data bases. variable. They may correspond to actual
attributes given in the data believed to form
59. Define Bayes’ theorem. a relationship (i.e. in the case of medical
Let X be a data tuple. In Bayesian terms, data a hidden variable may indicate a
X is considered “evidence”. Let H be some syndrome, representing a number of
hypothesis such as the data tuple X belongs symptoms that together characterizes a
to a specified class C. specific disease.
P(H|X) is the posterior probabilities of H Each represents a probabilistic dependence.
conditioned in H.
Suppose X is a 35 yr old customer with an y z
income of $40,000 and H is the hypothesis
that X will buy a computer given that we
know the customer’s age and income. In Y is the parent or immediate predecessor of
contrast. P(H) is prior probability of H. This Z, and Z is the descendent of Y.
is the probability that any given customer
will buy a computer, regardless of age,
income, or any other information. P(X|H) is
the posterior probability of X conditioned on
H. i.e. , it is the probability that a customer
X, is 35 yrs old and earns $40,000, given
that we know the customer will buy a
computer. P(X) is the prior probability of X. Family smoke
It is the probability that a person from our history r
set of customers is 35yrs old and earns
$40,000.
How are the probabilities estimated?
P(H), P(X|H), and P(X) may be estimated
from the given data. Bayes’ theorem is
useful in that it provides a way of emphyse
calculating the posterior probability P(H|X), Lung ma
from P(H) P(X|H), and P(X).
cancer
Bayes’ theorem is
P(H|X) = [ P(X|H) P(H)] / P(X)

60. What are Bayesian belief networks?


Give an example. Positive Dyspnea
Bayesian belief networks specify joint X-ray
conditional probability distribution. They
provide a graphical model of casual (a) A simple Bayesian belief network
relationships, on which learning can be
19

CPT covering algorithms are AQ, CN2, and the


most recent RIPPER. The general strategy is
FH,S FH,~S ~FH,S ~FH,~S as follows. Rules are learned one at a time.
Each time a rule is learned, the tuples
LC 0.8 0.5 0.7 0.1
covered by the rule are removed, and the
~LC 0.2 0.5 0.3 0.9 process repeats on the remaining tuples. The
sequential learning of rules is in contrast to
decision tree induction. The path to each leaf
(b) The conditional probability table for in a decision tree corresponds to a rule.
value of the variable Lung Cancer (LC)
showing each possible combination of the 63. What is back propagation?
value of its parents. Back propagation is a neural network
A belief network has one conditional learning algorithm. Neural network is a set
probability table (CPT) for each variable. of connected input/output units in which
The CPT for a variable Y specifies the each connection has a weight associated
conditional distribution P(Y | Parent(Y)), with it. During the learning phase, the
where Parent(Y) are the parents of Y. network learns by adjusting the weights so
as to be able to predict the correct label or
61. What is rule based classification? input tuples.
Rules are good way of representing Back propagation learns by iteratively
information or bits of knowledge. A rule processing a data set of training tuples
based classification uses a set of IF-THEN comparing the network’s prediction for each
rules for classification. An IF-THEN rule is tuple with the actual target value.
an expression of the form
IF condition THEN conclusion
Eg: 64. What is associative classification?
R1: IF age = youth AND student = yes In associative classification, association
THEN buys computer rules are generated and analyzed for use in
Explanation: The “IF” part (or left hand classification. The general idea is that we
side) of a rule is known as antecedent or can search for strong association between
precondition. The “THEN” part (or right frequent patterns (confirmation of attribute-
hand side) is the rule consequent. value pairs) and class labels. The decision
The rule R1 also can be written as tree induction considers only one attribute at
R1: (age = youth) ∩ (student = yes) => a time where as association rules explore
(buys_computer = yes) highly confident associations among
multiples attributes.
62. What is sequential covering
algorithm? How is it different from Various associative classification methods
decision tree induction? are -
IF-THEN rules can be extracted directly CBA classification – based associative CBA
from the training data (i.e. without having to uses an iterative approach to frequent item
generate a decision tree first) using a set mining.
sequential covering algorithm. Sequential CMAR (classification based on multiple
covering algorithms are most widely used association rules) – it differs from CBA in
approach to mining disjunctive sets of its strategy for a frequent item set mining
classification rules. Popular sequential and its construction of the classifier.
20

objects into k clusters so that the resulting


65. What are k-Nearest Neighbour intra cluster similarity is high but inter
classifier? cluster similarity is low. Cluster similarity is
Nearest neighbour classifier are based on measured in regard to the mean value of the
learning by analogy, i.e. by comparing a objects in a cluster (cluster centroid or
given tuple with training tuples that are center of gravity).
similar to it. The training tuples are How does k-means algorithm work?
described by n attributes. Each tuple The k-means algorithm proceeds as follows :
represents a point in an n-dimensional space. First it randomly selects k of the objects,
In this way, all of the training tuples are each of which initially represents a cluster
stored in an n-dimensional pattern space. mean or center. For each of the remaining
When given an unknown tuple, k-nearest- objects, an object is assigned to the cluster
neighbour classifier searches the pattern to which it is the most similar, based on the
space for the k training tuples that are distance between the objects and the cluster
closest to the unknown tuple. mean. It then computes the new mean for
each cluster. This process iterates until the
66. What is regression analysis? exterior function converges. Typically, the
Regression analysis can be used to model square – error criterion is used, defined as
the relationship between one or more E = ∑i=1i=k ∑ p є Ci | p-mi|2
independent or predictor variables and a Where E = sum of the square errors for all
dependent or response variable (which is objects in the data set
continuous-valued). The predictor variables p = point in space representing the
are attributes of interest describing the tuple given object
(i.e. making up the attribute vector). In mi = mean of cluster mi
general, the values of predictor variables are
known. The response variable is what we 69. What is k – Medoids method?
want to predict. Given a tuple described by The k – means algorithm is sensitive to
predictor variables, we want to predict the outliers because an object with an extremely
associated value of the response variable. large value may substantially distort the
Many problems can be solved by linear distribution of data.
regression. Several packages exist to solve Instead of taking the mean value of the
regression problems. Examples include – objects in a cluster as a reference point, we
SAS, SPSS and S-Plus. can pick actual objects to represent the
clusters, using one representative objects per
67.What is non-linear regression? cluster.
If a given response variable and prediction Each remaining object is clustered with the
variable have a relationship that may be representative object to which it is the most
modeled by a polynomial function, it is similar. The partitioning method is then
called non-linear regression or polynomial performed based on the principle of
regression. It can be modeled by adding minimizing the sum of the dissimilarities
polynomial terms to the basis linear model. between each object and its corresponding
reference point. The absolute error criterion
68. Explain clustering by k-means is defined as
partitioning.
The k-means algorithm takes the E = ∑ j=1j=k ∑ p є Cj | p-oj|
input parameter k , and partitions a set of n
21

Where E = sum of the absolute errors for all Statistical approach.


objects in the data set The distance based approach.
p = point in space representing the The density based local outlier approach.
given object in cluster Cj Deviation based approach.
oj is the representative object of Cj

The algorithm iterates until eventually, each UNIT – V


representative object is actually medoid or
most centrally located object of its cluster. 1. What are the areas in which data
This is the basis of the k – Medoids method warehouses are used in present and in
for grouping n objects into k clusters. future?
The potential subject areas in which data
70. What is outlier detection and ware houses may be developed at
analysis? present and also in future are
One person’s noise could be another (i).Census data:
person’s signal. Outliers are data objects that The registrar general and census
do not comply with the general behavior or commissioner of India decennially
the model of the data. Outliers can be caused compiles information of all individuals,
by measurement or execution error. Many villages, population groups, etc. This
data mining algorithms try to minimize the information
inference of outliers or eliminate them all is wide ranging such as the individual slip.
together. However, outliers may be of A compilation of information of individual
particular interest such as in the case of households, of which a database of
fraud detection. 5%sample is maintained for analysis. A data
warehouse can be built from this database
71. What is outlier mining? upon which OLAP techniques can be
Outlier detection and analysis is a interesting applied,
data mining task refered to as outlier mining. Data mining also can be performed for
Outlier mining has wide applications. It can analysis and knowledge discovery
be used in fraud detection (detecting unusual (ii).Prices of Essential Commodities
usage of credit cards etc). outlier mining can The ministry of food and civil supplies,
be described as follows : Government of India complies
Given a set of n data points or objects and k, daily data for about 300 observation centers
the expected number of outliers, find the top in the entire country on the prices of
key objects that are considered the essential commodities such as rice, edible
dissimilar, exceptional, or inconsistent with oil etc, A data warehouse can be built
respect to the remaining data. for this data and OLAP techniques can be
applied for its analysis

2. What are the other areas for Data


warehousing and data mining?
72. Explain in brief different outlier Agriculture
detection. Rural development
Computer based methods for outlier Health
detection : Planning
There are four approaches : Education
22

Commerce and Trade basic retrieval statement in a database


system.
3. Specify some of the sectors in which A knowledge query finds rules, patterns and
data warehousing and data mining are other kinds of knowledge in a
used? database and corresponds to querying
Tourism database knowledge including
Program Implementation deduction rules, integrity constraints,
Revenue generalized rules, frequent patterns and
Economic Affairs other regularities.
Audit and Accounts
9.Differentiate direct query answering
4. Describe the use of DBMiner. and intelligent query answering.
Used to perform data mining functions, Direct query answering means that a query
including characterization, answers by returning exactly what
association, classification, prediction and is being asked.
clustering. Intelligent query answering consists of
analyzing the intent of query and
5. Applications of DBMiner. providing generalized, neighborhood, or
The DBMiner system can be used as a associated information relevant to the
general-purpose online analytical query.
mining system for both OLAP and data
mining in relational database and 10. Define visual data mining
datawarehouses. Discovers implicit and useful knowledge
Used in medium to large relational databases from large data sets using data and/
with fast response time. or knowledge visualization techniques.
Integration of data visualization and data
6. Give some data mining tools. mining.
DBMiner
GeoMiner 11. What does audio data mining mean?
Multimedia miner Uses audio signals to indicate patterns of
WeblogMiner data or the features of data mining
results.
7. Mention some of the application areas Patterns are transformed into sound and
of data mining music.
DNA analysis To identify interesting or unusual patterns
Financial data analysis by listening pitches, rhythms, tune
Retail Industry and melody.
Telecommunication industry Steps involved in DNA analysis
Market analysis Semantic integration of heterogeneous,
Banking industry distributed genome databases
Health care analysis. Similarity search and comparison among
DNA sequences
8. Differentiate data query and Association analysis: Identification of co-
knowledge query occuring gene sequences
A data query finds concrete data stored in a Path analysis: Linking genes to different
database and corresponds to a stages of disease development
23

Visualization tools and genetic data analysis Extracting undiscovered and implied spatial
information.
12.What are the factors involved while Spatial data: Data that is associated with a
choosing data mining system? location
Data types Used in several fields such as geography,
System issues geology, medical imaging etc.
Data sources
Data Mining functions and methodologies 17. Explain multimedia data mining.
Coupling data mining with database and/or Mines large data bases.
data warehouse systems Does not retrieve any specific information
Scalability from multimedia databases
Visualization tools Derive new relationships , trends, and
Data mining query language and graphical patterns from stored multimedia data
user interface. mining.
Used in medical diagnosis, stock markets
13. Define DMQL ,Animation industry, Airline
Data Mining Query Language industry, Traffic management systems,
It specifies clauses and syntaxes for Surveillance systems etc.
performing different types of data mining
tasks for example data classification, data 18. What is Time Series Analysis?
clustering and mining association A time series is a set of attribute values over
rules. Also it uses SQl-like syntaxes to mine a period of time. Time Series
databases. Analysis may be viewed as finding patterns
in the data and predicting future values.
14. Define text mining
Extraction of meaningful information from 19. What are the various detected
large amounts free format textual patterns?
data. Detected patterns may include:
Useful in Artificial intelligence and pattern Trends
 : It may be viewed as systematic
matching non-repetitive changes to the values over
Also known as text mining, knowledge time.
discovery from text, or content Cycles
 : The observed behavior is cyclic.
analysis. Seasonal
 : The detected patterns may be
based on time of year or month or day.
15. What does web mining mean Outliers
 : To assist in pattern detection ,
Technique to process information available techniques may be needed to remove or
on web and search for useful data. reduce the impact of outliers
To discover web pages, text documents ,
multimedia files, images, and other
types of resources from web.
Used in several fields such as E-commerce,
information filtering, fraud 20.What is a spatial database?
detection and education and research. A spatial database stores large
amount of space – related data, such as
16.Define spatial data mining. maps, preprocessed remote sensing or
24

medical imaging data, and VLSI chip


layout data. 26. What are the two multimedia indexing
and retrieval systems?
21.Define spatial data mining? The following are the problems in
Spatial data mining refers to the using decision tree
extraction of knowledge, spatial 1) Description – based retrieval
relationships, or other interesting system – builds indices and perform
patterns not explicitly stored in the object retrieval based on image
spatial database. Such mining demands descriptions such as keywords, time of
an integration of data mining with spatial creation, size etc
databases. 2) Content – based retrieval
system – these systems support retrieval
22. What is spatial data warehouse? based on the image content such as color
A spatial data warehouse is a histogram, texture, shape etc
subject – oriented, integrated, time –
variant, and nonvolatile collection both 27. What are the retrieval methods (based
spatial and nonspatial data in support of on signature) proposed for similarity –
spatial data mining and spatial – data – based retrieval in image databases?
related decision – making process. h. Color histogram – based
signature
i. Multimedia composed
23. What are the different dimensions in a signature
spatial data cube? j. Wavelet – based signature
There are three dimensions in a k. Wavelet – based with region
spatial data cube – based granularity
e. Nonspatial dimension
f. Spatial – to – nonspatial 28. What is feature descriptor?
dimension A feature descriptor is a set of
g. Spatial – to – spatial vectors for each visual characteristic.
dimension The main vectors are the color vector, a
MFO (Most Frequent Orientation)
24. What is progressive refinement? vector, and a MFC (Most Frequent
Progressive refinement is an Color) vector.
optimization method for mining spatial
association rules from spatial database. This 29. What is a time – series database?
method first mines large data sets roughly A time – series database consists
using a fast algorithm and then improves the of a sequence of values or events that
quality of mining in a pruned data set using change with time. It is a sequence
a more expensive algorithm. database in which the values are
measured at equal intervals.
25. What is spatial classification?
Spatial classification analysis 30. What is a sequence database?
spatial objects to derive classification A sequence database contains
schemes in relevance to certain spatial sequence of ordered events with or
properties such as neighborhood of a without concrete notion of time.
district, highway, river etc.
25

31. What are the two popular data sources such as news articles, research
independent transformations? papers, books, digital libraries, e – mail
1) Discrete Fourier Transform messages, and web – pages. Data is
(DFT) stored in semi-structured form.
2) Discrete Wavelet Transform
(DWT) 37. What is Information retrieval (IR)?
Information Retrieval is
32. What are similarity searches that concerned with the organization and
handle gaps and differences in offsets and retrieval of information from large
amplitudes? number of text based documents. A
The searches that handle gaps typical information retrieval problem is
and amplitude are to locate relevant documents based on
• Atomic matching user input, such as keywords or example
• Window stitching documents.
• Subsequence ordering+
38.What are the basic measures for
33. What are the parameters that affect accessing quality of text retrieval?
the result of sequential pattern mining? 1) Precision – This is the
o Duration percentage of retrieved documents that
o Event folding window are in fact relevant to the query.
o Interval 2) Recall – This is the percentage
of documents that are relevant to the
query and where, in fact, retrieved
34.What is a serial episode and a parallel
episode?
39.Write short notes on multidimensional
A serial episode is a set of
data model?
events that occurs in a total order,
Data warehouses and OLTP tools are based
whereas a parallel episode is a set of
on a multidimensional data model.
events whose occurrence ordering is
This model is used for the design of
trivial.
corporate data warehouses and department
data
35. What is periodicity analysis? What
marts. This model contains a Star schema,
are the problems in periodic analysis?
Snowflake schema and Fact constellation
Periodicity analysis is the mining
schemas. The core of the multidimensional
of periodic patterns, i.e. the search for
model is the data cube.
recurring patterns in a time – series
database. The following are the
40.Define data cube?
problems in periodic analysis
It consists of a large set of facts (or)
l. Mining full periodic patterns
measures and a number of dimensions.
m. Mining partial periodic
41.What are facts?
patterns
Facts are numerical measures. Facts can also
n. Mining cyclic or periodic
be considered as quantities by which
association rules
we can analyze the relationship between
dimensions.
36. What is text database?
A text database consists of large
42.What are dimensions?
collection of documents from various
26

Dimensions are the entities (or) perspectives a) A large central table (fact
with respect to an organization for table) containing data with
keeping records and are hierarchical in no redundancy
nature. b) A set of smaller attendant
tables (dimension tables),
43.Define dimension table? one for each dimension
A dimension table is used for describing the
dimension. 48.What is snowflake schema?
(e.g.) A dimension table for item may The snowflake schema is a variant of the
contain the attributes item_ name, brand and star schema model, where some
type. dimension tables are normalized thereby
further splitting the tables in to additional
44.Define fact table? tables.
Fact table contains the name of facts (or)
measures as well as keys to each of the 49.List out the components of fact
related dimensional tables. constellation schema?
This requires multiple fact tables to share
45.What are lattice of cuboids? dimension tables. This kind of schema
In data warehousing research literature, a can be viewed as a collection of stars and
cube can also be called as cuboids. For hence it is known as galaxy schema (or) fact
different (or) set of dimensions, we can constellation schema.
construct a lattice of cuboids, each showing
the 50.Point out the major difference between
data at different level. The lattice of cuboids the star schema and the snowflake
is also referred to as data cube. schema?
The dimension table of the snowflake
46.What is apex cuboid? schema model may be kept in normalized
The 0-D cuboid which holds the highest form to reduce redundancies. Such a table is
level of summarization is called the apex easy to maintain and saves storage space.
cuboid. The apex cuboid is typically denoted
by all. 51.Which is popular in the data
warehouse design, star schema model (or)
47.List out the components of star snowflake schema model?
schema? Star schema model, because the snowflake
_ A large central table (fact table) containing structure can reduce the effectiveness
the bulk of data with no and more joins will be needed to execute a
redundancy. query.
_ A set of smaller attendant tables
(dimension tables), one for each 52.Define concept hierarchy?
dimension. A concept hierarchy defines a sequence of
mappings from a set of low-level
Star schema. concepts to higher-level concepts.
Multidimensional data model
can exist in the form of star schema. It 53.Define total order?
consists of If the attributes of a dimension which forms
a concept hierarchy such as
27

“street<city< province_or_state <country”, 60.What is dice operation?


then it is said to be total order. The dice operation defines a sub cube by
Country performing a selection on two (or) more
Province or state dimensions.
City
Street 61.What is pivot operation?
Fig: Partial order for location This is a visualization operation that rotates
the data axes in an alternative
54.Define partial order? presentation of the data.
If the attributes of a dimension which forms
a lattice such as 62.List the applications data warehousing
“day<{month<quarter; week}<year, then it
is said to be partial order. * Decision support
* Trend analysis
55.Define schema hierarchy? * Financial forecasting
A concept hierarchy that is a total (or) * Churn Prediction for Telecom subscribers,
partial order among attributes in a database Credit Card users etc.
schema is called a schema hierarchy. * Insurance fraud analysis
* Call record analysis
56.List out the OLAP operations in * Logistics and Inventory management
multidimensional data model? * Agriculture
_ Roll-up
_ Drill-down 63.Name some of the data mining
_ Slice and dice applications?
_ Pivot (or) rotate Data mining for Biomedical and DNA
data analysis
57.What is roll-up operation? Data mining for Financial data analysis
The roll-up operation is also called drill-up Data mining for the Retail industry
operation which performs aggregation Data mining for the Telecommunication
on a data cube either by climbing up a industry
concept hierarchy for a dimension (or) by
dimension reduction. 64.What are the contribution of data
mining to DNA analysis?
58.What is drill-down operation? Semantic integration of
Drill-down is the reverse of roll-up heterogeneous,distributed genome databases
operation. It navigates from less detailed Similarity search and comparison among
data DNA sequences
to more detailed data. Drill-down operation Association analysis: identification of co-
can be taken place by stepping down a occuring gene sequences
concept hierarchy for a dimension. Path analysis: linking genes to different
stages of disease development
59.What is slice operation? Visualization tools and genetic data
The slice operation performs a selection on analysis
one dimension of the cube resulting in
a sub cube. 65.Name some examples of data mining in
retail industry?
28

Design and construction of data a. class based generalization of complex


warehouses based on the benefits of data objects including set valued,list
mining valued,class-subclass hierarchies, and class
Multidimensional analysis of composition hierarchies
sales,customers,products,time and region b. constructing object data cube
Analysis of the effectiveness of sales c. performing generalization -based mining.
campaigns
Customer retention-analysis of customer 69. Give an example of star -schema of
loyalty spatial data warehouse.
Purchase recommendation and cross-
reference of item There are 3000 weather probes
distributed in British Clombia (BC),
66. Name some of the data mining Canada, each recording daily temperature
applications and precipitation for a designated small area
Data mining for Biomedical and DNA and transmitting signals to a provincial
data analysis weather station with a spatial data
Data mining for Financial data analysis warehouse that supports spatial OLAP, a
Data mining for the Retail industry user can view weather patterns on a map by
Data mining for the Telecommunication mouth, by region,etc.
industry

67. What are the features of object-


relational and object oriented data bases?

Both kinds of systems deal with the


efficient storage and access of vast amounts
of disk-based complex structured
objects.They organise a large set of data
objects into classes , which in turn organised
into class/sub class hierarchies.each object
in a class is associated with a. an object-
identifier,b. a set of attributes,c. a set of
70. How a spatial data warehouse is
methods that specify that computational
constructed?
routines or rulesassociated with each object
class.
As with relational data, we can integerate
spatial data to construct a data warehouse
68. How data mining is performed on
that facilitates spatial data mining.
complex data types?
A spatial data warehouse is a subject
oriented, integerated, time-variant and non
Vast amounts of data are stored in
volatile collection of both spatial and non
various complex forms.The complex data
spatial data.
type include objects,spatial data, multimedia
data, text data and web data.
71. What are spatial association rules?
Multidimensional analysis and data mining
can be performed by
Similar to mining of associations rules in
transaction and relational databases, spatial
29

association rules can be mined in spatial


databases. A spatial association rule is of the 73. Explain Audio & Video Data mining.
form A=>B [s%,c%],where A and B are sets There are great demands for effective
od spatial or non spatial prediactes, s% is the content-based retreival and data mining
support of the rule, and c% is the methods for audio and video data.Examples
confidence of the rules. include:
Eg. : is_a(X,"school") ^ Besides still images, an in commisionable
close_to(X,"spors_center") => amount of audio visual informaionis in
close_to(X,"park") [0.5%,80%] digital archives , in the world wide web, in
This rule states that 80% of schools are close broadcast data streams, and in personal and
to the park and 0.5% of the data belongs to professional datbases. Typical examples
such a case. Various kinds of spatial include searching for and multimedia
predicates can constitute a spatial editing of particular video clips in a TV
association rule. studio, detecting suspicious persons or
Eg. : a. Distance information(such as secures in surveilance videos.
close_to and far_away) To facilitate the recoeding , searchand
b. topological relations (like intercept, analysis of audio and video information
overlap, and disjoint) and from multimedia data, the following
c.spatial or (such as left_of and industry standards are available
right_of). a. MPEGV(Moving Picture Expert Group)
b. JPEG(Joint Photography Experts Group )
72. What are different categories of
mining associations in multimedia data? 74. What are the different text retrieval
Associations rules involving multimedia methods?
objects can be mined in image and video a. Document Selection methods- A query
databases. the three categories in mining specifies constraint function selecting
associations are relevant documents.A typical method is the
a. Association betweem image content and Boolean retreival model in which the
non image content features: A rule like " If document is represented by a set of
at least 50% of the upper part of the picture keywords and Boolean expressions.
is blue, then it is likely to represent " Eg: "Database systems but not Oracle"
Here the image content is limited to the b. Document ranking methods - Used to
keyword "sky". rank all documents in the order of
b. Association among image contents that relavance.In these methods, we may match
are not related to spatial relationships. Eg: If the keywords in the query with those in the
a picture contains two blue squares , then it documents and score each document based
is likely to contain one red circle as well. on how well it matches the query. The goal
c. Associations among image contents is to approximate the degree of relevance of
related to spatial relationships. Eg: A rule documents with a score computed based on
like" If a red triangle is in between two information such as the frequency of words
yellow suares, then it is likely a big oval- in the document and the whole collection.
shaped object is underneath.
To mine association among multimedia 75. How can automated document
objects, we can treat each image as a classification be performed?
transaction and find frequently occuring
patterns among different images.
30

There are tremendous number of online 77. What are the different methods of
documents available. Automated document document clustering?
classification is an important text mining
task as need exists to automatically prganize Document clustering is one of the most
documents into classes to facilitate crucial techniques for organizing documents
document retrrreival and subsequent in an unsupervised manner ( class label not
analysis.A general procedure for automated unown earlier)
document classification a. Spectral clustering method: first performs
First a set of pre classified document is spectral embedding (dimensionality
taken as a trainiing set. The training set is reduction) on the original data, and then
thenanalyzed in order to derive a applies the traditional cluatering algorithm
classification scheme.Such a classification (eg k-means) on the reduced document
scheme often needs to be refined with a space.
testing process. The so-derived classification b. The mixture modal clustering method :
scheme can be used for classification of models the text datawith a mixture
other on-line documents. model(invloving mutilnormal component
A few typical classification methods used in models)
text classification are Clustering involves two steps
a. Nearest-neighbour classification (1). estimating the model parameters based
b. Feature selection methods on the text data and any additional prior
c. Bayesian classification. knowledge and
(2) infering the clusters based on the
76. Explain breifly some data estimated model parameters.
classification methods. c. The latent semantic indexing (LPI) :
These are linear dimensionality reduction
a. Nearest-neighbor classification: Using methods.We can acquire tranformation
the k-nearest-neighbor classification which vectors or embedding function through
is based on the intution that similar which we use function and embed all of the
documents are expected to be assigned the data to lower-dimensional space.
same class label.
i)We can simply index the training 78. What is time series data base?
documents and associate each with a class
label. A time series database consists of
ii)The class label of Text document can be sequences of values or events obtained over
determined based on class label distribution repeated measurements of time(hourly ,daily
of k nearest neighbors. , weekly) .TIme- series databases are
By timing k and incorporationg refinements, popular in many applications, such as stock
this kind of classification can acheive market analysis ,economic and sales
accuracy of a best classification. statistically forecasting , budgetory analysis.workload
uncorrelated with the class labels. projections , process and quality control ,
c. bayesian classification first trains the natural phenomena (such as atmosphere)
model by calculating a generative document temperature wind, earth quake), scientific
distribution P(d/c) to each C of document d and engineering experiments and medical
and then tests which class is most likely to treatments.
generate the test document. The amount of time-series data is increasing
rapidly (giga bytes/day) such as in stock
31

trading or even per minute (such as NASA


space programs). Need exists to find
correlation relationships within time series
data as well as analysing huge numbers of
regular patterns, trends, bursts(such as
sudden sharp changes) and outline with fast
or even real-time response.

79. What is trend analysis?

A time series data involving a variable


Y, representing, say, the closing price of a
share in a stock market, can be viewed as a
function of time t, that is , Y = f(t). Such a
fig : TIme series data of stock price
function can be illustrated as a time -series
Dashed curve shows the trend
graph.
How can we study the time series data ?
80. What are the basic measures for text
There are two goals
retrieval?
(1)Modelling time series (to gain insight into
the mechanisms or underlying forces that
a.Precision - This is the percentage of
generate time series.
retreived documents that are relevant to the
(2)forecasting time series (to predict the
query ( ie correct response)
future values of the time series variables.
Trend analysis consists of following 4
precision = | { Relevant} n(intersection)
major components
{retreived} |
1)trend or long-term movements- displayed
| { Retreived} |
by a trend curve or a trend line.
2)Cyclic movements or cyclic variations -
b. Recall - This is the percentage of
refer to cycles - the long-term oscillations
documents that are relavant to the query and
about a trend line or curve.
retreived
3)Seasonal movement5s or variations -
These are systematic or claendar related.
recall = | {Relevant} n {retreived} |
Eg: Events that recur annually - sudden
| {Relevant} |
increase in sales of items before christmas.
The observed increase in water consumpton
during summer
b. A feature selection preocess can be used
to remove terms in the training documents
that are
4)irregular or random movements
Eg:floods, personal changes within
companies.

. 81. What is an object cube?


32

In an object database, data generalization 3) Analysis of stream data.


and multidimensional analysis are not 4) Distributed data mining.
applied to induvidual objects but to classes 5) Visualizing & querying tools.
of objects. The attribute oriented induction
method developed for mining characteristics 84. What are the recent trends in data
of relational databases can be extended to mining ?
mine data characteristics in object 1)Applications exploration - in financial
databases.The generalization of analysis,telecommunications,biomedicine,
multidimensional attributes of a complex countering terrorism etc.
object class can be performed by a complex 2)Scalable and interactive data mining
object class can be performed by examining methods - constraint based mining
each attribute (or dimension ) generalizing 3) integeration of DM systems with db,dw,
each attribute to simple - valued data, and and web dB systems.
constructing a multidimensional data 4) Standardiztion of data mining language.
cube,called an object cube.once an object is 5) Visual data mining.
constructed , multidimensional analysis and 6) New methods for mining complex types
data mining can be performed on it in a of data.
manner similar to that for relational data 7) Bilogical data mining - mining DNA and
cubes. prtein sequences etc.
8) Dm applications in software engineering
82. What are the challanges faced in web 9) Web mining
data mining ? 10) Distributed data mining
11) Real-time or time critical DM
1) The web seems to be too large for 12) graph mining
effective datawarehousing and data mining. 13) Privacy protection and information
2) the complexity of web is far greater than security in DM
that of any text document collection. Web 14) Multi relational & multi database DM
pages lack a unifying structure.They contain
far more authorityu style and content 85. What is web usage mining?
variations.
3) The web is highly dynamic information Besides mining web contents and web
source. news, stock market , weather etc are structures , another important task for web
updated regularly on the web. mining is web usage mining which mines
4) The web serves a broad diversity of user weblog records to discover user access
communities. The internet currently patterns of web pages. This helps to identify
connects more than 100 million work high potential custimers for electronic
stations. users can easily get lost by commerce , improve web server
grouping in the " darkness" of the network. performance etc. A web sewrver usually
5) Only small portion of the information on registers a weblog entry , for every access
the web is truely relevant or useful. of web page. It includes URL requestes , IP
address from which the request originated
83. Whatr are the data mining and a time stamp.
applications?
86. What are similarity based retreival in
1) Intrusion detection image data bases?
2) Association & correlation analysis.
33

a. Description based retreival systems -


which bulids indices and perform object
retreival based on image description such as
keywords , captions, size and time of
evaluation.
b. Content based retreival systems - which
supports retreival based on the image
content , such as color histogram , texture ,
pattern , image topology , and the shape of
the objects and their in the image.

87. What are the approaches used for


similarity based retreival in image data
bases?

1) Colo-histogram based signature : It is


based on the color composition of tha image,
It does not contain aany information about
shape,image topology or texture.
2) Multifeature composed signature : The
signature of the image includes multiple
features - color histogram,shape,image
topology and texture. The extracted features
are stored as meta data and images are
indexed based on meta data.
3) Wavelet based signature : This approach
uses the dominant wavelet coeffiecients of
an image as signature.Wavelets capture
shape,texture, and image topology
information in a single unified frame work.
4) Wavelet-based signature with region-
based granularity : The computation and
comparison of signatures are at the
granularity of regions.

S-ar putea să vă placă și