In the last segment, we have seen how we can run queries on the dataframe API of Spark. Let us now run the same queries on the same dataset but using Spark SQL. We are using the ‘file2‘ from our Spark Dataframe jupyter notebook where all the DateTime fields are in timestamp format. Let us hear it from Vishwa Mohan on Spark SQL functionality.
Note
Please note that in this module, you may sometimes see that the kernel is mentioned as Python 2 instead of PySpark. This is because some of these videos are older and the Python 2 kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC2 instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC2 instances with additional configurations).
As we have seen in this video, Spark SQL functionality is useful in querying a dataset.We can run directly run SQL queries by registering a temporary table.
We have discussed dataframes and Spark SQL in PySpark.
Note
The Jupyter notebook used in this segment is attached in the previous segment.