DataFrame Operations

We have already discussed how to create DataFrames. Let’s take a look at the next video to understand some basic operations on DataFrame API.

Note

Please note that in this module, you may sometimes see that the kernel is mentioned as Python ₂ instead of PySpark. This is because some of these videos are older and the Python ₂ kernel had the PySpark libraries installed already. For the current configuration of EMR, you will need to use the PySpark kernel only. The SME might also mention EC₂ instance instead of EMR instance which is supposed to be in our case(At the most basic level, EMR instances make use of EC₂ instances with additional configurations).

Filter Command

The filter() command can be called on a DataFrame directly. As an argument to the filter method, we give the column name and specify the condition over the column. There are multiple ways to apply filter commands.

The argument to the filter command can also be given in the following SQL syntax:

Example

Python

df.filter("column>500").show()

Output

It can also be called using a column type object.

Example

Python

df.filter(df['column']>500)

Output

The filter command can handle multiple filter conditions. You can provide the conditions under a single command using the logical operators or through various filter commands. Spark will handle both the methods in a similar manner as long as there is no action between the filters. This is because Spark will combine the filters internally while executing them. It is possible because of two reasons – lazy evaluation and the Catalyst optimiser.

DataFrames are also immutable. Any operation that results in any change in DataFrame results in a new DataFrame.

Other operations discussed in the video include:

select(): Just like SQL select, it is used to select which column must be present in the output.

groupBy(): Just as Spark SQL, groupBy() is used for grouping rows in a table based on some value in a column. It is used to perform some aggregation on the grouped data.

orderBy(): orderBy() is used to arrange the rows in some particular order, which could be either ascending or descending order.

We have discussed some basic techniques to create DataFrames and some operation on the data. Let’s now run a few queries on some datasets in the upcoming segments.

Report an error