Documente Academic
Documente Profesional
Documente Cultură
data names and definitions of the given The various data smoothing
warehouse. techniques are
• Binning
14.Define VLDB? • Clustering
Very Large Data Base. If a database whose • Combined computer and
size is greater than 100GB, then human inspection
the database is said to be very large • Regression
database.
18.What is Binning?
15.What is data cleaning? Binning is used to smooth data
Data cleaning routines remove values by consulting its neighborhood
incomplete, noisy and inconsistent data by values. The sorted values are distributed
• filling in missing values into a number of “buckets” or “bins”.
• smoothing out noise The data are first sorted and then
• identifying outliers and partitioned into equidepth bins. There
• correcting inconsistencies in the data are three types of binning
• Smoothing by bin means –
Each value is replaced by the
mean value of the bin
16.Mention the categories of data that
• Smoothing by bin median –
may be encountered in mining.
Each bin value is replaced by the
The data used in the analysis by the
bin median
data mining techniques may fall under
the following categories • Smoothing by boundaries –
The maximum and minimum
• Incomplete data – lacking values in the bin are identified as
attribute value or certain
bin boundaries. Each value in the
attributes of interest
bin is replaced by the closest
• Noisy data – Data containing boundary value
errors or outlier values that
deviate from the expected. Noise 19.What is data integration? What are
is defined as a random error or the issues to be considered while
variance in a measured variable integrating data?
• Inconsistent data – There may Data integration combines data
be inconsistencies in data from multiple sources into a coherent data
recorded in some transactions, store. Issues to be considered are
inconsistencies due to data a) Entity identification problem
integration (where a given b) Correlation analysis
attribute may have different c) Detection and resolution of
names in different database), data value conflict
inconsistency due to data
redundancy 20.What is data transformation? What
are the various methods of transforming
17.What are the various data smoothing data?
techniques to remove noise? Data transformation
transforms and consolidates data
8
into forms appropriate for When a decision tree is built, many of the
mining. The following are branches will reflect anomalies in
various methods of transforming the training data due to noise or outlier. Tree
data pruning methods address this
i. Smoothing problem of over fitting the data.
ii. Aggregation Approaches:
iii. Generalization Pre pruning
iv. Normalization Post pruning
v. Attribute construction
6. Define Pre Pruning
UNIT III A tree is pruned by halting its construction
early. Upon halting, the node
becomes a leaf. The leaf may hold the most
1. Define the concept of classification. frequent class among the subset
Two step process samples.
A model is built describing a predefined
set of data classes or concepts. 7. Define Post Pruning.
The model is constructed by analyzing Post pruning removes branches from a
database tuples described by “Fully grown” tree. A tree node is
attributes. pruned by removing its branches.
The model is used for classification. Eg: Cost Complexity Algorithm
2. What is Decision tree?
A decision tree is a flow chart like tree 8.Define information gain.
structures, where each internal The information gain measure is
node denotes a test on an attribute, each used to select the test attribute at each
branch represents an outcome of the test, node in the tree. Such a measure is
and leaf nodes represent classes or class referred to as an attribute selection
distributions. The top most in a tree is the measure or measure of goodness of
root node. split. The attribute with the highest
information gain is chosen as the test
3.What is tree pruning? attribute for the current node.
Tree pruning attempts to identify
and remove branches that reflect noise or 9. How does tree pruning work?
outliers in the training data with the goal There are two approaches to tree
of improving classification accuracy on pruning
unseen data. a. In prepruning approach, a
tree is pruned by halting its
4. What is Attribute Selection Measure? construction early. E.g. by
The information Gain measure is used to deciding not to further split
select the test attribute at each node the training samples at a
in the decision tree. Such a measure is given node. Upon halting, the
referred to as an attribute selection measure node becomes a leaf node.
or a measure of the goodness of split. b. In postpruning approach, all
branches from a fully-grown
5. Describe Tree pruning methods. tree are removed. The lowest
pruned node becomes a leaf
9
and is labeled by the most ID3 is algorithm used to build decision tree.
frequent class. The following steps are followed to built a
decision tree.
10.How are classification rules extracted a. Chooses splitting attribute with highest
from a decision tree? information gain.
The knowledge represented b. Split should reduce the amount of
in a decision tree can be information needed by large amount.
extracted and represented in the
form of classification IF-THEN 15. What is the difference between
rules. One rule is created for each “supervised” and unsupervised” learning
path from the root to a leaf node. scheme.
E.g. IF age=”<=30” AND In data mining during classification the class
student = “yes” THEN label of each training sample is provided,
buys_computer = “no” this
IF age=”<=30” AND student type of training is called supervised learning
= “no” THEN buys_computer = “yes” (i.e) the learning of the model is supervised
in that it is
told to which class each training sample
11. What are Bayesian classifiers? belongs. Eg.:Classification
Bayesian classifiers are In unsupervised learning the class label of
statistical classifiers. They can predict each training sample is not known and the
class membership probabilities such as member
the probability that a given sample or set of classes to be learned may not be
belongs to a particular class. known in advance. Eg.:Clustering
different objects into meaningful and Interval scaled variables are continuous
descriptive objects. measurements of linear scale.
For example, height and weight, weather
4. What are the fields in which clustering temperature or coordinates for any cluster.
techniques are used? These measurements can be calculated using
Clustering is used in biology to develop Euclidean distance or Minkowski distance.
new plants and animal
taxonomies. 8. Define Binary variables? And what are
Clustering is used in business to enable the two types of binary variables?
marketers to develop new Binary variables are understood by two
distinct groups of their customers and states 0 and 1, when state is 0, variable is
characterize the customer group on basis absent and when state is 1, variable is
of purchasing. present. There are two types of binary
Clustering is used in the identification of variables,
groups of automobiles symmetric and asymmetric binary variables.
Insurance policy customer. Symmetric variables are those variables that
Clustering is used in the identification of have same state values and weights.
groups of house in a city on Asymmetric variables are those variables
the basis of house type, their cost and that have
geographical location. not same state values and weights.
Clustering is used to classify the
document on the web for information 9. Define nominal, ordinal and ratio
discovery. scaled variables?
A nominal variable is a generalization of the
5.What are the requirements of cluster binary variable. Nominal variable
analysis? has more than two states, For example, a
The basic requirements of cluster analysis nominal variable, color consists of four
are states,
Dealing with different types of attributes. red, green, yellow, or black. In Nominal
Dealing with noisy data. variables the total number of states is N and
Constraints on clustering. it is
Dealing with arbitrary shapes. denoted by letters, symbols or integers.
High dimensionality An ordinal variable also has more than two
Ordering of input data states but all these states are ordered
Interpretability and usability in a meaningful sequence.
Determining input parameter and A ratio scaled variable makes positive
Scalability measurements on a non-linear scale, such
as exponential scale, using the formula
6.What are the different types of data AeBt or Ae-Bt
used for cluster analysis? Where A and B are constants.
The different types of data used for cluster
analysis are interval scaled, binary, 10. What do u mean by partitioning
nominal, ordinal and ratio scaled data. method?
In partitioning method a partitioning
7. What are interval scaled variables? algorithm arranges all the objects into
11
various partitions, where the total number of method all the objects are arranged within a
partitions is less than the total number of big singular cluster and the large cluster is
objects. Here each partition represents a continuously divided into smaller clusters
cluster. The two types of partitioning until each cluster has a single object.
method are
k-means and k-medoids. 14. What is CURE?
Clustering Using Representatives is called
11. Define CLARA and CLARANS? as CURE. The clustering algorithms
Clustering in LARge Applications is called generally work on spherical and similar size
as CLARA. The efficiency of clusters. CURE overcomes the problem of
CLARA depends upon the size of the spherical and similar size cluster and is more
representative data set. CLARA does not robust with respect to outliers.
work
properly if any representative data set from 15. Define Chameleon method?
the selected representative data sets does not Chameleon is another hierarchical clustering
find best k-medoids. method that uses dynamic modeling.
To recover this drawback a new algorithm, Chameleon is introduced to recover the
Clustering Large Applications based drawbacks of CURE method. In this method
upon RANdomized search (CLARANS) is two
introduced. The CLARANS works like clusters are merged, if the interconnectivity
CLARA, the only difference between between two clusters is greater than the
CLARA and CLARANS is the clustering interconnectivity between the objects within
process a cluster.
that is done after selecting the representative
data sets. 16. Define Association rule mining.
Association rule mining
12. What is Hierarchical method? searches for interesting
Hierarchical method groups all the objects relationships among items in a
into a tree of clusters that are arranged given data set. Rule support and
in a hierarchical order. This method works confidence are the two measures
on bottom-up or top-down approaches. of rule interestingness.
36. What are the things suffering the E.g. age (X, “20…29”) ∧buys (X,
performance of Apriori candidate “laptop”) ⇒ buys (X, “b/w printer”)
generation technique.
Need to generate a huge number of 40. Mention few approaches to mining
candidate sets Multilevel Association Rules
Need to repeatedly scan the scan the Uniform minimum support for all
database and check a large set of levels(or uniform support)
candidates by pattern matching Using reduced minimum support at lower
levels(or reduced support)
37. Describe the method of generating Level-by-level independent
frequent item sets without candidate Level-cross filtering by single item
generation. Level-cross filtering by k-item set
Frequent-pattern growth(or FP Growth)
adopts divide-and-conquer 41. What are multidimensional
strategy. association rules?
Steps: Association rules that involve two or more
->Compress the database representing dimensions or predicates
frequent items into a frequent pattern tree Interdimension association rule:
or FP tree Multidimensional association rule with no
->Divide the compressed database into a set repeated predicate or dimension
of conditional database Hybrid-dimension association rule:
->Mine each conditional database separately Multidimensional association rule with
multiple occurrences of some predicates or
dimensions.
38. Define Iceberg query.
It computes an aggregate function over an 42. Define constraint-Based Association
attribute or set of attributes in Mining.
order to find aggregate values above some Mining is performed under the guidance of
specified threshold. various kinds of constraints
Given relation R with attributes a1,a2,…..,an provided by the user.
and b, and an aggregate function, The constraints include the following
agg_f, an iceberg query is the form Knowledge type constraints
Select R.a1,R.a2,…..R.an,agg_f(R,b) Data constraints
From relation R Dimension/level constraints
Group by R.a1,R.a2,….,R.an Interestingness constraints
Having agg_f(R.b)>=threshold Rule constraints.
Visualization tools and genetic data analysis Extracting undiscovered and implied spatial
information.
12.What are the factors involved while Spatial data: Data that is associated with a
choosing data mining system? location
Data types Used in several fields such as geography,
System issues geology, medical imaging etc.
Data sources
Data Mining functions and methodologies 17. Explain multimedia data mining.
Coupling data mining with database and/or Mines large data bases.
data warehouse systems Does not retrieve any specific information
Scalability from multimedia databases
Visualization tools Derive new relationships , trends, and
Data mining query language and graphical patterns from stored multimedia data
user interface. mining.
Used in medical diagnosis, stock markets
13. Define DMQL ,Animation industry, Airline
Data Mining Query Language industry, Traffic management systems,
It specifies clauses and syntaxes for Surveillance systems etc.
performing different types of data mining
tasks for example data classification, data 18. What is Time Series Analysis?
clustering and mining association A time series is a set of attribute values over
rules. Also it uses SQl-like syntaxes to mine a period of time. Time Series
databases. Analysis may be viewed as finding patterns
in the data and predicting future values.
14. Define text mining
Extraction of meaningful information from 19. What are the various detected
large amounts free format textual patterns?
data. Detected patterns may include:
Useful in Artificial intelligence and pattern Trends
: It may be viewed as systematic
matching non-repetitive changes to the values over
Also known as text mining, knowledge time.
discovery from text, or content Cycles
: The observed behavior is cyclic.
analysis. Seasonal
: The detected patterns may be
based on time of year or month or day.
15. What does web mining mean Outliers
: To assist in pattern detection ,
Technique to process information available techniques may be needed to remove or
on web and search for useful data. reduce the impact of outliers
To discover web pages, text documents ,
multimedia files, images, and other
types of resources from web.
Used in several fields such as E-commerce,
information filtering, fraud 20.What is a spatial database?
detection and education and research. A spatial database stores large
amount of space – related data, such as
16.Define spatial data mining. maps, preprocessed remote sensing or
24
31. What are the two popular data sources such as news articles, research
independent transformations? papers, books, digital libraries, e – mail
1) Discrete Fourier Transform messages, and web – pages. Data is
(DFT) stored in semi-structured form.
2) Discrete Wavelet Transform
(DWT) 37. What is Information retrieval (IR)?
Information Retrieval is
32. What are similarity searches that concerned with the organization and
handle gaps and differences in offsets and retrieval of information from large
amplitudes? number of text based documents. A
The searches that handle gaps typical information retrieval problem is
and amplitude are to locate relevant documents based on
• Atomic matching user input, such as keywords or example
• Window stitching documents.
• Subsequence ordering+
38.What are the basic measures for
33. What are the parameters that affect accessing quality of text retrieval?
the result of sequential pattern mining? 1) Precision – This is the
o Duration percentage of retrieved documents that
o Event folding window are in fact relevant to the query.
o Interval 2) Recall – This is the percentage
of documents that are relevant to the
query and where, in fact, retrieved
34.What is a serial episode and a parallel
episode?
39.Write short notes on multidimensional
A serial episode is a set of
data model?
events that occurs in a total order,
Data warehouses and OLTP tools are based
whereas a parallel episode is a set of
on a multidimensional data model.
events whose occurrence ordering is
This model is used for the design of
trivial.
corporate data warehouses and department
data
35. What is periodicity analysis? What
marts. This model contains a Star schema,
are the problems in periodic analysis?
Snowflake schema and Fact constellation
Periodicity analysis is the mining
schemas. The core of the multidimensional
of periodic patterns, i.e. the search for
model is the data cube.
recurring patterns in a time – series
database. The following are the
40.Define data cube?
problems in periodic analysis
It consists of a large set of facts (or)
l. Mining full periodic patterns
measures and a number of dimensions.
m. Mining partial periodic
41.What are facts?
patterns
Facts are numerical measures. Facts can also
n. Mining cyclic or periodic
be considered as quantities by which
association rules
we can analyze the relationship between
dimensions.
36. What is text database?
A text database consists of large
42.What are dimensions?
collection of documents from various
26
Dimensions are the entities (or) perspectives a) A large central table (fact
with respect to an organization for table) containing data with
keeping records and are hierarchical in no redundancy
nature. b) A set of smaller attendant
tables (dimension tables),
43.Define dimension table? one for each dimension
A dimension table is used for describing the
dimension. 48.What is snowflake schema?
(e.g.) A dimension table for item may The snowflake schema is a variant of the
contain the attributes item_ name, brand and star schema model, where some
type. dimension tables are normalized thereby
further splitting the tables in to additional
44.Define fact table? tables.
Fact table contains the name of facts (or)
measures as well as keys to each of the 49.List out the components of fact
related dimensional tables. constellation schema?
This requires multiple fact tables to share
45.What are lattice of cuboids? dimension tables. This kind of schema
In data warehousing research literature, a can be viewed as a collection of stars and
cube can also be called as cuboids. For hence it is known as galaxy schema (or) fact
different (or) set of dimensions, we can constellation schema.
construct a lattice of cuboids, each showing
the 50.Point out the major difference between
data at different level. The lattice of cuboids the star schema and the snowflake
is also referred to as data cube. schema?
The dimension table of the snowflake
46.What is apex cuboid? schema model may be kept in normalized
The 0-D cuboid which holds the highest form to reduce redundancies. Such a table is
level of summarization is called the apex easy to maintain and saves storage space.
cuboid. The apex cuboid is typically denoted
by all. 51.Which is popular in the data
warehouse design, star schema model (or)
47.List out the components of star snowflake schema model?
schema? Star schema model, because the snowflake
_ A large central table (fact table) containing structure can reduce the effectiveness
the bulk of data with no and more joins will be needed to execute a
redundancy. query.
_ A set of smaller attendant tables
(dimension tables), one for each 52.Define concept hierarchy?
dimension. A concept hierarchy defines a sequence of
mappings from a set of low-level
Star schema. concepts to higher-level concepts.
Multidimensional data model
can exist in the form of star schema. It 53.Define total order?
consists of If the attributes of a dimension which forms
a concept hierarchy such as
27
There are tremendous number of online 77. What are the different methods of
documents available. Automated document document clustering?
classification is an important text mining
task as need exists to automatically prganize Document clustering is one of the most
documents into classes to facilitate crucial techniques for organizing documents
document retrrreival and subsequent in an unsupervised manner ( class label not
analysis.A general procedure for automated unown earlier)
document classification a. Spectral clustering method: first performs
First a set of pre classified document is spectral embedding (dimensionality
taken as a trainiing set. The training set is reduction) on the original data, and then
thenanalyzed in order to derive a applies the traditional cluatering algorithm
classification scheme.Such a classification (eg k-means) on the reduced document
scheme often needs to be refined with a space.
testing process. The so-derived classification b. The mixture modal clustering method :
scheme can be used for classification of models the text datawith a mixture
other on-line documents. model(invloving mutilnormal component
A few typical classification methods used in models)
text classification are Clustering involves two steps
a. Nearest-neighbour classification (1). estimating the model parameters based
b. Feature selection methods on the text data and any additional prior
c. Bayesian classification. knowledge and
(2) infering the clusters based on the
76. Explain breifly some data estimated model parameters.
classification methods. c. The latent semantic indexing (LPI) :
These are linear dimensionality reduction
a. Nearest-neighbor classification: Using methods.We can acquire tranformation
the k-nearest-neighbor classification which vectors or embedding function through
is based on the intution that similar which we use function and embed all of the
documents are expected to be assigned the data to lower-dimensional space.
same class label.
i)We can simply index the training 78. What is time series data base?
documents and associate each with a class
label. A time series database consists of
ii)The class label of Text document can be sequences of values or events obtained over
determined based on class label distribution repeated measurements of time(hourly ,daily
of k nearest neighbors. , weekly) .TIme- series databases are
By timing k and incorporationg refinements, popular in many applications, such as stock
this kind of classification can acheive market analysis ,economic and sales
accuracy of a best classification. statistically forecasting , budgetory analysis.workload
uncorrelated with the class labels. projections , process and quality control ,
c. bayesian classification first trains the natural phenomena (such as atmosphere)
model by calculating a generative document temperature wind, earth quake), scientific
distribution P(d/c) to each C of document d and engineering experiments and medical
and then tests which class is most likely to treatments.
generate the test document. The amount of time-series data is increasing
rapidly (giga bytes/day) such as in stock
31