Now that you are clear the EDA and have selected the features, let’s look at data preparation steps. Let’s hear from Ajay how this can be done.
As explained in this video, String Indexer is not required in the present case as all the values present are integers.
Next, you have to perform one-hot encoding as your DataFrame has categorical variables. One-hot encoding can be performed using the following code:
Once-hot encoding creates new columns with the names specified in ‘outputCols’ and appends those in the DataFrame.
Now, you know that model building in PySpark requires inputs as vectors. So, vectors have to be created from the selected columns. This can be achieved by using the following code:
After running the above code, a new column named ‘features’ would be created and appended to the DataFrame. After defining the one-hot encoding and vector assembler, you can put these steps into a pipeline and obtain the final DataFrame using the fit() method as follows:
You have now obtained the DataFrame, from which you need the ‘label’ and ‘feature’ columns for model building. So, in the last step of data preparation, the final DataFrame is created using the ‘label’ and ‘features’ column.
Report an error