In this segment, you will learn about a few data preparation steps that are usually followed before building a machine learning model. This exercise is meant to give you a quick overview of the MLlib API and also make you familiar with it.
An important note as you move along the course is you will be using Spark ML library which is a dataframe based API, unlike mllib library which is an RDD based API.
Note:
For the EMR cluster configuration for this module, please follow the same configuration used for the Introduction to Apache Spark module. Make sure that you are using the PySpark kernel when running the EMR notebooks.
For this exercise, we will be using an automobile data set with various features such as the number of CYLINDERS, DISPLACEMENT, HORSEPOWER, WEIGHT and ACCELERATION. The data contains numerous missing and garbage values that will be handled during one of the data preparation steps.
The Jupyter Notebook used for this exercise attached below.
The first step of data preparation is interacting with Spark and reading the data into a DataFrame. You can do this by using the read() method and mentioning the file type and the file path.
The read() method is available through the SparkSession class and supports various optional methods for indicating the header and schema. By setting the header to true, Spark treats the first row in the DataFrame as the header and the remaining rows as samples. Similarly, by setting ‘inferschema’ to true, Spark automatically infers the schema of the data set.
The columns with garbage values (MPG and HORSEPOWER) are auto-inferred to the data type string. You can cast them to type double by using the .cast() method. After casting the columns, the entries with garbage values and NA will be converted to null.
In the next segment, you will learn how to handle this scenario.
Additional Resources:
Here is the link to the spark documentation where you can go through the various features provided by the Spark MLlib library.
Report an error