Documente Academic
Documente Profesional
Documente Cultură
These exercises are intended to be a first contact with clustering and classification algorithms, as well as with Matlab Statistics Toolbox. In order to make the visualization and the interpretation of the results easier, we use low-dimensional datasets, but one must keep in mind that real-life data may be described by hundreds, thousands, or even hundreds of thousands of features
2. Using the Matlab functions crossval and kfoldLoss, compute the cross-validation error of the KNN classifier for different choices of the number of neighbors (see the property NumNeighbors of Matlabs KNN classifier to change the number of neighbors). How many neighbors do you choose ? What is the cross-validation error of this classifier ? 3. Using the Matlab function loss and the test data, compute the test error of this classifier. Question 2: verifying the relevance of the cross-validation heuristic 1. While the cross-validation heuristic is widely used in practice to select the hyperparameters of machine learning algorithms, in this question we will verify that the number of neighbors chosen by cross validation is reasonable compared to the optimal number of neighbors (i.e. the one we would choose to perform best on the test data). To that end, for different number of neighbors, compute the cross-validation error and the test error of the KNN classifier. What is the optimal number of neighbors on the test data ? What is the corresponding classification error? Compare to what was obtained by cross-validation. Question 3: a first decision tree 1. Using the function ClassificationTree.fit, train a decision tree on the same training data as before. In order to grow a tree as large as possible before pruning, set the MinParent argument of ClassificationTree.fit to 1. 2. Matlab decision trees have a method called cvLoss. Using this function with the option subtrees set to all, compute the cross-validation error of all the different prunings of the tree. Using the best level of pruning (returned by cvLoss) and Matlab function prune, build the optimal tree by cross-validation. 3. What is the cross-validation of the obtained tree ? Compare it the the KNN classifier built before: which classifier would you choose on this dataset ? 4. Compare the test error of the decision tree and of the KNN classifier. Verify that the choice between the two algorithms made using the cross-validation error is correct. 5. By plotting the data and looking at the decision boundary, can you tell why one algorithm is more appropriate than the other on this dataset? Question 4: Varying the number of examples: 1. Using the datasets data*ex.mat (where * stands for the number of examples), build KNN classifiers with an increasing number of training examples. Choose the number of neighbors for each training dataset by cross-validation, and plot the generalization error as a function of the number of training examples. How does the generalization error evolve with the size of the training set? Is this evolution expected?
1. Build a KNN classifier and a decision tree using dataset2.mat. Choose the hyperparameters by cross-validation. Which algorithm seems best on this dataset? Verify your choice using the test set. 2. By looking at the decision tree obtained (using Matlab view function), or by looking at the most important features for the decision tree (using the property predictorImportance of Matlab decision trees), can you tell why there is a large difference between the two algorithms performances ? Can you find a way to improve KNNs performance ?