Sunteți pe pagina 1din 3

Advanced Data Analysis Part II : machine learning Assignment 1

These exercises are intended to be a first contact with clustering and classification algorithms, as well as with Matlab Statistics Toolbox. In order to make the visualization and the interpretation of the results easier, we use low-dimensional datasets, but one must keep in mind that real-life data may be described by hundreds, thousands, or even hundreds of thousands of features

Exercise 1: k-means clustering


The goal of this exercise is to perform the clustering of a dataset containing 60 examples in two classes : pregnancy and labor. The instances that you have to cluster can be found in the file pregVSlabor_instances.mat. The file pregVSlabor_labels.mat, which contains the true classes of the instances, can be used to visualize the data and to perform an indirect evaluation of the clustering. 1. Load the instances and the corresponding labels into your Matlab workspace. Each instance is described by four features. By printing the data matrix, what can you observe about the features? Construct a new dataset containing only the two features that seem the most important, and create a 2D scatter plot of the data (using the Matlab function scatter; you can use the labels as colors in your scatter plot, to visualize which instance belongs to which class). 2. Create k-means clusterings of the data for different values of the number of clusters k (using the Matlab function kmeans). 3. Visualize some of the clustering obtained (using the scatter function, and the cluster index of the instances as colors). How many clusters would you choose? Why ? 4. Choose the optimal number of clusters according to the BIC criterion. How many clusters do you find? 5. Perform an indirect evaluation of the clustering(s) obtained (both according to your manual choice and to the BIC criterion if they are different) using the class labels.

Exercise 2: KNN and decision trees Part 1


In this exercise, we will use the datasets stored in data*ex.mat (* stands for any of the 150, 500, 1000, 6000 examples). These datasets contain examples from the same distribution (examples have 2D-feature representations stored in the first two columns of the data matrices, and binary class labels stored in the third column of each data matrix). The various datasets will serve as training sets of different sizes, in order to evaluate the influence of the number of examples on the generalization error of the learning algorithms. A test set is available in the file datatest.mat. For the needs of the exercise, the test set contains a large number of examples, so that it can be used as a reliable measure of the generalization error of the algorithms. Question 1: a first KNN classifier 1. Using the function ClassificationKNN.fit, build a KNN classifier using data500ex.mat as training data.

2. Using the Matlab functions crossval and kfoldLoss, compute the cross-validation error of the KNN classifier for different choices of the number of neighbors (see the property NumNeighbors of Matlabs KNN classifier to change the number of neighbors). How many neighbors do you choose ? What is the cross-validation error of this classifier ? 3. Using the Matlab function loss and the test data, compute the test error of this classifier. Question 2: verifying the relevance of the cross-validation heuristic 1. While the cross-validation heuristic is widely used in practice to select the hyperparameters of machine learning algorithms, in this question we will verify that the number of neighbors chosen by cross validation is reasonable compared to the optimal number of neighbors (i.e. the one we would choose to perform best on the test data). To that end, for different number of neighbors, compute the cross-validation error and the test error of the KNN classifier. What is the optimal number of neighbors on the test data ? What is the corresponding classification error? Compare to what was obtained by cross-validation. Question 3: a first decision tree 1. Using the function ClassificationTree.fit, train a decision tree on the same training data as before. In order to grow a tree as large as possible before pruning, set the MinParent argument of ClassificationTree.fit to 1. 2. Matlab decision trees have a method called cvLoss. Using this function with the option subtrees set to all, compute the cross-validation error of all the different prunings of the tree. Using the best level of pruning (returned by cvLoss) and Matlab function prune, build the optimal tree by cross-validation. 3. What is the cross-validation of the obtained tree ? Compare it the the KNN classifier built before: which classifier would you choose on this dataset ? 4. Compare the test error of the decision tree and of the KNN classifier. Verify that the choice between the two algorithms made using the cross-validation error is correct. 5. By plotting the data and looking at the decision boundary, can you tell why one algorithm is more appropriate than the other on this dataset? Question 4: Varying the number of examples: 1. Using the datasets data*ex.mat (where * stands for the number of examples), build KNN classifiers with an increasing number of training examples. Choose the number of neighbors for each training dataset by cross-validation, and plot the generalization error as a function of the number of training examples. How does the generalization error evolve with the size of the training set? Is this evolution expected?

Exercise 3: KNN and decision trees Part 2


In this exercise, you will use the training data contained in dataset2.mat, and the test data contained in dataset2_test.mat. In these datasets, the instances are represented with 12 features (stored in the first 12 columns of the data matrices), and have binary labels (stored in the last column of the data matrices).

1. Build a KNN classifier and a decision tree using dataset2.mat. Choose the hyperparameters by cross-validation. Which algorithm seems best on this dataset? Verify your choice using the test set. 2. By looking at the decision tree obtained (using Matlab view function), or by looking at the most important features for the decision tree (using the property predictorImportance of Matlab decision trees), can you tell why there is a large difference between the two algorithms performances ? Can you find a way to improve KNNs performance ?

S-ar putea să vă placă și