So far, you have learnt about the features and different components of Spark APIs. Now, in this segment, you will learn how to create a DataFrame from various data sources. Let us first understand how each file format can be loaded into Spark Dataframe and then saved back into different file formats.
The users.csv dataset and the jupyter Notebook used in this session is as follows:
Note
Upload the dataset into the EMR instance and provide the complete path for the file location. If the file is present in the livy directory (/user/livy/), then you can simply mention the name of the file as can be seen in the videos.
We start by loading csv files into the dataframe and then saving that dataframe into JSON, Parquet and ORC file formats.
Note
Please note that in this module, you may sometimes see that the kernel is mentioned as Python 2 instead of PySpark. This is because some of these videos are older and the Python 2 kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC2 instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC2 instances with additional configurations).
To load the file, it must be present in the HDFS.
In the video above, you saw how Spark reads csv. The following syntax is used for loading a csv file.
Example
spark.read.load(“filename.fileformat”, format = “fileformat”, inferSchema = True, header = True)
Output
This syntax can be used for loading any file format. To specifically load csv files, there is one more way.
Example
df = spark.read.csv("file name.csv", inferSchema = True, header = True)
Output
The inferSchema option here asks Spark to read the schema on its own. If it turns out that the inferred schema is not correct, one can change the data type manually as well using StructType as shown in the video.
Some operators that are used by the Adjunct Faculty in this video are:
df.show(): This command can be used to show the entries in a dataframe in the table form. This method has two different attributes – number of entries and truncate. We will see their use in the upcoming segments.
df.printSchema(): This command shows different columns and their data types in the dataframe.
Now, that we have learnt loading data from csv files and running some operations on it, let us understand how we can save the dataframes in different file formats.
For saving the dataframe in a JSON format, we use the following command:
Example
df.write.json("File_Name")
Output
For saving the dataframe in a parquet format, we use the following command:
Example
df.write.parquet("File_Name")
Output
For saving the dataframe in an ORC format, we use the following command:
Example
df.write.orc("File_Name")
Output
To verify how these file formats have been saved, our Adjunct Faculty shows the saved JSON file in the upcoming video.
So far, we have read the data from the csv file, loaded it into dataframe and then saved it into various other formats. Let us now load the file we saved in json format in the last video and load it into the dataframe.
In this video, we have loaded data from JSON format into dataframe. Our SME is performing some operations on this dataframe in the upcoming video.
To summarize, the json file has data in key/value pairs. If we use StructType method to change the name of the key, it will consider it as a new key and the dataframe formed will have a new column with null values in it.
Our Adjunct faculty, Vishwa Mohan, is now discussing how to read parquet and ORC files into dataframes.
So far, we have seen how we load different file formats into dataframes, some operations on the dataframes and then saving them into various file formats. In the upcoming segments, we will see how to load hive tables into dataframes.