In this optional practice assignment, you can try your hands on applying PCA to a supervised classification problem. **You will not get any feedback on this assignment.
You will use the popular MNIST handwritten digits dataset and build a classifier to predict the label of a given digit. Rahim will quickly walk you through the problem statement.
Dataset
The dataset consists of images of handwritten numeric digits between 0-9. Each image is of 28 x 28 pixels, i.e. 28 pixels along both length and breadth of the image. Each pixel is an attribute with a numeric value (representing the intensity of the pixel), and thus, there are 784 attributes in the dataset.
You can download the dataset from Kaggle here.
Problem Statement
The task is to build a classifier that predicts the label of an image (a digit between 0-9) given the features. Thus, this is a 10-class classification problem. Since this dataset has a large number of features (784), you can use PCA to reduce the dimensionality, and then, build a model on the low-dimensional data.
An informative workflow is mentioned below:
- Classifier without PCA: First try to build a model on the original 784 attributes. You may try using logistic regression, random forests, and so on.
- Tune the hyperparameters of your model using an appropriate method. Which evaluation metric would you choose to measure the model performance?
- Which model would perform the best?
- Classifier with PCA: Now, try to reduce the dimensionality of the dataset from 784 to a lower number k, and build a model of your choice (ideally chosen based on step-1 above) on the lower dimension data. Here, you’ll need to think about the following uncertainties:
- How would you find the optimal value of k? Try experimenting with various values of k, build and tune the classifier with various values of k, and compare the model performance at various values of k with the model hyperparameters.
- Notice that now you are treating k as a ‘model’ hyperparameter along with the classifier’s hyperparameters, where now your ‘composite model’ has two models in sequence — PCA and a classifier.
- How would you tune the optimal value of k and the classifier’s hyperparameters simultaneously?
- Hint: Try using the Pipeline feature of sklearn to chain the two models and tune the hyperparameters using GridSearchCV.
- You can use the following rubrics to self-evaluate your solution.
Self-evaluation Rubrics
| Criteria | Meets Expectations | Doesn’t Meet Expectations |
| Data Understanding, Preparation, and EDA | Relevant data quality checks are performed, and all data quality issues are addressed in the right way (missing value imputation, removing duplicate data, and other kinds of data redundancies if any). Data is scaled appropriately before applying PCA. | All the quality checks are not done, and data quality issues are not addressed correctly up to an appropriate level. Data is not scaled appropriately. |
| Model Building and Evaluation | Model parameters are tuned using correct principles, and the approach is explained clearly. Model evaluation is done using the correct principles, and appropriate evaluation metrics are chosen. Optimal hyperparameters are correctly chosen using cross-validation. Pipelining of PCA and the classifier is done correctly, and the optimal ‘n_components’ and hyperparameters are chosen. The results are on par with the best possible model in the dataset. | Model parameters are not tuned using the correct principles, and the approach is not explained clearly. Model evaluation is not done using the correct principles, and appropriate evaluation metrics are not chosen. Optimal hyperparameters are incorrectly chosen using cross-validation (e.g. using incorrect metrics/training data rather than test). The results are not on par with the best possible model in the dataset. |
| Conciseness and Readability of Code | The code is concise and syntactically correct. Wherever appropriate, built-in functions and standard libraries are used instead of writing long code snippets (containing if-else statements, for loops, and so on). Custom functions are used to perform repetitive tasks. The code is readable with appropriately named variables, and detailed comments are provided wherever necessary. | Long, complex code snippets are used instead of shorter built-in functions. Custom functions are not used to perform repetitive tasks, which results in the repeated execution of the same piece of code. Code readability is poor because of vaguely named variables or a lack of comments where they are necessary. |