EDA and Feature Selection

Since you are now familiar with the data set, the next important step in the model building process involves the EDA of the data set. Also, as seen in the previous segment, the given data set has many columns, all of which cannot be used as the features for model training. So, in the upcoming video, Ajay will take you through the EDA and the feature selection process.

As Ajay says, the models can be trained only on those features which are integers as you cannot infer anything from the string features. So, from the given set of features , you can select only those features which are integers and create a new DataFrame with the selected features.

After removing the string features, you need to perform one-hot encoding on the selected features.

One-Hot Encoding

One-hot encoding creates K columns for K categories. Each category is represented by a vector of 0s, and each vector has only one ‘hot value’ or ‘1’. For example, the table given below shows various types of fruit and the calories they contain.

Fruit	Category	Calories per 100 gm
Banana	1	89
Apple	2	52
Banana	1	89
Mango	3	60

After one-hot encoding, the same metrics will look like the following:

Apple	Banana	Mango	Calories per 100 gm
0	1	0	89
1	0	0	52
0	1	0	89
0	0	1	60

The columns become hot only when a category is present, hence, the name one-hot encoding. Encoding variables this way results in wastage of a lot of space as it creates additional columns in the DataFrame. The larger the number of categories in the feature, the more number of columns and the more sparse matrix would be created. So, you need to remove those columns which have large unique categorical values. For that, you need to ascertain the total number of distinct values present in your DataFrame columns. Let’s watch the next video to see how this can be done.

Hence, to ascertain the number of unique values, Ajay used the following code:

Now, from the table shown earlier, you can see that the columns C14, C17, C19, C20 and C21 have a large number of distinct values. So, you can ignore those columns. After this, you have the final DataFrame on which your model can be trained. However, before the training, you have to prepare the data. So, let’s go through the data preparation steps in the next segment.

Report an error