Feature Transformer: Vector Assembler

After identifying the necessary attributes, you can assemble the features in the form of a vector using the VectorAssembler() transformer. In the next video, you will learn how the VectorAssembler() feature transformer works.

A feature transformer transforms the data stored in a data frame and stores the data back as a new data frame. Generally, this transformation is done by appending one or more columns to the existing data frame. It can be broken down to the simple following sequence: DataFrame =[transform]=> DataFrame. Transformer methods are generally executed while preparing and processing data sets.

The VectorAssembler() features transformer takes a set of individual column features as input and returns a vector containing all the column features. It is an extension of the Transformer class and supports the .transform() method.

In a data set, the value of different variables are on scales. Some variable have values in the unit’s range, while other variables have values in the range of thousands. Such varying scales may lead to the dominance of one features over the other in the final model (for example, they may lead to clustering, about which you will learn later). Moreover, in the case of linear regression, it is difficult to judge the feature importance from the coefficient values if the features are not on the same scale. Therefore, it is important to scale all these values into a single range. It is common to scale all the data variables within the range [0,1]. Scaling can be performed using transformers such as MaxAbsScaler() and MinMaxScaler().

Additional Reading

Normalisation of Data using Spark: You can refer to this link in order to learn more about the various data normalization techniques.

Feature Transformer: This is the link to Spark documentation where you can learn about the various feature transformers that are available with the Spark MLlib Library

Report an error