Sunteți pe pagina 1din 2

UNSUPERVISED LEARNING

TOTAL MARKS:70 DURATION: 4 HOURS

INSTRUCTIONS: -
1. Candidates should answer all the questions in the same order provided in the question paper.
2. Any activity that compromises the integrity of the examination will not be permitted.
3. Students should complete the examination within the provided timeline.
4. Candidates are expected to check and ensure that the correct answer file (in. ipynb format) is uploaded
in LMS.

Dataset Information:
The dataset given is about TB prevalence, all forms (per 100000 populations per year) in different countries.
Group countries based on how similar their situation has been year-by-year to understand the world situation
regarding the tuberculosis disease. The cluster information is given for reference. Please remove the same
before building the models.

Note: Mention all the assumptions made and also if some of the sub questions cannot be done, please mention
the reason for not doing.

1. Data Understanding (5 marks)


a. Read the dataset (tab, csv, xls, txt, inbuilt dataset). What are the number of rows and no. of cols
& types of variables (continuous, categorical etc.)? (1 MARK)
b. Calculate five-point summary for numerical variables (1 MARK)
c. Summarize observations for categorical variables – no. of categories, % observations in each
category. (1 MARK)
d. Generate the covariance and correlation tables for the data (1 MARK)
e. Create Visualization plots to find the relationship amongst the variables. (1 MARK)

2. Dimensionality Reduction (10 marks)


a. How will you decide when to apply PCA based on the correlation? (2 marks)
b. Apply PCA on the above dataset and determine the number of PCA components to be used so
that 95% of the variance in data is explained by the same. (8 marks)

3. Clustering: Use PCA dimensions to cluster the data. Apply K-means and Agglomerative clustering.
(30 Marks)
Some pointers which would help you, but don’t be limited by these
a. Find the optimal K Value. (5 marks)
b. Apply Clustering and find out if the data points have been clustered correctly using appropriate
visualization (20 marks)
UNSUPERVISED LEARNING

c. Evaluate the clusters formed using appropriate metrics to support the model built and compare
both the models. (5 marks)

4. Use the cluster labels from the best method above and convert the problem to a supervised learning
classification. (15 marks)
a. Split dataset into train and test (70:30) (2 marks)
b. Are both train and test representative of the overall data? How would you ascertain this
statistically? (3 marks)
c. In case of a Supervised Machine Learning Problem, how will you decide when to apply
PCA & How do you improve the accuracy of the model? Write clearly the changes that
you will make before re-fitting the model. Fit the final model. Please feel free to have any number
of iterations to get to the final answer. Marks are awarded based on the quality of final model
you are able to achieve. (10 marks)

5. Summarize as follows (10 marks)


a. Summarize the overall fit of the model. Compare all the clustering and classification models built
and list down the measures to prove that it is a good model.
b. Write down a business interpretation/explanation of the model.
c. Which variables are affecting the target the most and explain the relationship. Feel free to use
charts or graphs to explain.
d. What changes from the base model had the most effect on model performance?
e. What are the key risks to your results and interpretation?

S-ar putea să vă placă și