Sunteți pe pagina 1din 6

Answer # 1(a)

1. Movie Ratings: Ordinal


Movie ratings is an ordinal attribute since there is a defined order between these ratings i.e.
5>4>3>2>1. However the distance between the values is not defined , i.e. if a a movie with a
rating of 2.5 is not said to be half as good as another movie with a rating of 5 stars. Also,
addition, subtraction of movie ratings is not rational to do , so it cannot be Interval or Ratio
attribute.

2. Percentage of correct answers in a quiz test: Ratio


Percentages are real numbers and all mathematical calculations are allowed on it, so it is a Ratio
attribute . Additionally, there is also a zero point i.e. 0% defined, which means that there are no
correct answers.

3. Seat numbers on a flight: Nominal


This is Nominal attribute because seat numbers are just as labels or to enumerate the seats for the
passengers. Also, no mathematical operations can be performed on different seat numbers.

Answer # 1(b)
Calculation Nominal Ordinal Interval Ratio
frequency yes yes yes yes
distribution
median and no yes yes yes
percentiles
addition and no no yes yes
subtraction
mean and no no yes yes
standard
deviation
coefficient of no no no yes
variation

Answer # 1(c)

Missing data in the training data set can reduce the fit of a model and can lead to a biased model
because in that case we have not analyzed the behavior and relationship with other variables correctly.
It can lead to wrong prediction or classification. The twc techniques that can be deployed for hadnling
missing values are Delete Strategy and Mode Strategy.

1. Delete Strategy:- Ignore all instances with missing values


Pros: Simplicity is the major advantage of this algorithm towards handling the missing data as we
just ignore the whole instance for a missing variable value. It is simple as it does not require any
changes in the learning algorithm and the only changes in the learning algorithm is to add a filter
that blocks instances with unknown/missing values.

Cons: This strategy reduces the size of the training data and hence reduces the power/fit of the
model due to reduced sample size. Moreover, deleting multiple instances can bias the data as
some attribute may not be relevant to a particular class. For instance, if the outcome of a model
depends on the missing values, the results generated will be biased and incorrect. Using this
strategy will also delete other attribute values in addition to the missing data values for those
specific instances , and hence all the training set data will not be used modeling.

Using delete strategy for a dataset is a simpler approach (in comparison with Mode strategy)
towards handling missing data. This strategy will reduce the size of dataset. To use this strategy,
the distribution of missing data should be analyzed first. For example, if the outcome of an
experiment depends on the missing values, the results generated will be biased and incorrect.
Using this approach, other attribute values along with missing data value for particular instances
are also deleted, hence all information is not used for experiments.

2. Mode Strategy: Replace any missing value with the most common

Pros: Using the Mode strategy, all the available data set about all instances will be used for
training and experimenting. Hence, the results will not be biased if the outcome does not depend
on the missing attribute value. The outcomes will be based on a full data-set analysis. It can be
best advantageous for utilizing the data dependent information by replacing the missing values of
discrete attributes with the most common value.

Cons: Replacing the missing value for an attribute with the most common value can bias the data
sometimes and produce misleading results specially in particular cases where the results are
critically dependent on the missing value , e. g: Medical diagnosis. Additionally, if the number of
unknown attribute values is high, replacing the missing value of an attribute with the mode of that
attribute reduces the variability of the data. . This strategy cannot be utilized where the missing
value is required to be a unique identifier such as a person's SSN ID number.

Answer # 2(a)

@relation shapes

@attribute width numeric


@attribute height numeric
@attribute sides numeric
@attribute class {standing,lying}

@data
2,4,4,standing
3,6,4,standing
4,3,4,lying
7,8,3,standing
7,6,3,lying
2,9,4,standing
9,1,4,lying
10,2,3,lying

Answer # 2(b)

OneR JRip J4.8 Naive Bayes


Training Set 50.00% 87.50% 87.50% 100.00%
Percentage Split 33.33% 66.67% 66.67% 66.67%
Cross-validation 25.00% 37.50% 62.50% 62.50%
(6 folds)

OneR algorithm showed the lowest accuracy for all the three testing options; Training Set, PS, and CV.
This implies that considering only one attribute to classify is not accurate for the given data set, hence
generating lower accuracy for OneR. On the other hand, since Naive Bayes considers all the attributes
of instances for classifying it has achieved the highest accuracy for all three testing options as
compared to JRip and J48. The accuracy for Training Set in Naive Bayes is 100% since the data
utilized for training is same as the testing data. However, even for CV and PS, none of the algorithms
could achieve a perfect 100% accuracy since they do not compare the attribute values and the solution
to this problem is dependent upon the comparison of Height and Width attribute.

Answer # 2(c)
Deploying the clustering strategy , particularly SimpleKMeans , we visualized the graph of width Vs.
Heigth in Fig. 1. From the graph, we observed that the instances cannot be classified solely on the basis
of attribute value of height and width. It is the ratio of width to height, that will help in classifying the
instances as standing or lying. From Fig. 2. we observe that for the graph of Class Vs. Height, the
instances are not correctly identified into two classes. No threshold value for Height can be obtained
that can divide the instances into two classes aptly. Similarly in Fig 2,we observe that for the Graph of
Width Vs. Class, the instances are not correctly identified as no threshold value for the Width can be
computed to divide the instances into two classes. Thus, the only way to classify the instances is by the
combination of height and width of the instances in the form of a ratio as shown in Figure 1.
Additionally, in figure 1for Width Vs, Heigth graph, we can observe that the best fit model for the
graph is linear and hence a linear classification can be obtained to identify all instances into standing or
lying classes.

S-ar putea să vă placă și