EDA and Feature Selection

Since you are now familiar with the data set, the next important step in the model building process involves the EDA of the data set. Also, as seen in the previous segment, the given data set has many columns, all of which cannot be used as the features for model training. So, in the upcoming video, Ajay and Amit will take you through the EDA and the feature selection process.

Note:

There are various reasons to perform feature selection as it helps in making the model more generalisable and prevent it from learning noise. It also saves computational resources and helps in maintaining the independence between features which is necessary.

As Ajay says, the models can be trained only on those features which are integers as you cannot infer anything from the string features. So, from the given set of features, you can select only those features which are integers and create a new DataFrame with the selected features.

After removing the string features, you need to perform one-hot encoding on the selected features.

One-Hot Encoding

One-hot encoding creates K columns for K categories. Each category is represented by a vector of os, and each vector has only one ‘hot value’ or ‘1’. For example, the table given below shows various types of fruit and the calories they contain.

After one-hot encoding, the same metrics will look like the following:

The columns become hot only when a category is present, hence, the name one-hot encoding. Encoding variables this way results in wastage of a lot of space as it creates additional columns in the DataFrame. The larger the number of categories in the feature, the more number of columns and the more sparse matrix would be created. So, you need to remove those columns which have large unique of distinct values present in your DataFrame columns. Let’s watch the next video to see how this can be done.

Hence, to ascertain the number of unique values, Ajay used the following code:

The output of the above code is as follows:

Now, from the table shown earlier, you can see that the columns C14, C17, C19, C20 and C21 have a large number of distinct values. So, you can ignore those columns. After this, You have the final DataFrame on which your model can be trained. However, before the training, you have to prepare the data. So, let’s go through the data preparation steps in the next segment.

Report an error