Additional Resources

In this segment, you will learn some concepts in Spark SQL, such as Window function, Partition and row_number, and you will understand how to use them. These concepts will help you while working on the project.

row_number() – Spark SQL

During the creation of Dimension tables, you will have to add a row number/index to each table so that a primary key can be established.

Spark SQL provides a function called row_number() as part of the window functions group, which can be used to assign a sequential integer number to each row in a DataFrame according to the partition decided.

Let’s assume that you have a dataset of employee information of a certain company. You have a dataframe called emp_df having the following data:

emp_name	dept	salary
Abhishek	Sales	1000
Kumar	Legal	2000
Saif	Finance	1200
Singh	HR	1500

You can use the following code to add another column, having sequential numbers as values, to the DataFrame:

Example

Python

from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number

res_df =emp_df.select("emp_name","dept","salary",F.row_number().over(Window.partitionBy().orderBy(res_df['emp_name'])).alias("index"))

Output

This code snippet will essentially add another column at the end of the table called index, which will have sequential integer values. The final DataFrame would look as follows:

emp_name	dept	salary	index
Abhishek	Sales	1000	1
Kumar	Legal	2000	2
Saif	Finance	1200	3
Singh	HR	1500	4

Using this method, you will be able to create a unique value for each row of your dimension tables.

The concepts learned in this segment will help you during the creation of the dimensions and fact tables.

Report an error