IKH

Additional Resources

In this segment, you will learn some concepts in Spark SQL, such as Window function, Partition and row_number, and you will understand how to use them. These concepts will help you while working on the project.

row_number() – Spark SQL

During the creation of Dimension tables, you will have to add a row number/index to each table so that a primary key can be established.

Spark SQL provides a function called row_number() as part of the window functions group, which can be used to assign a sequential integer number to each row in a DataFrame according to the partition decided.

Let’s assume that you have a dataset of employee information of a certain company. You have a dataframe called emp_df having the following data:

emp_name dept salary
Abhishek
Sales
1000
Kumar
Legal

2000
Saif
Finance
1200
Singh
HR

1500

You can use the following code to add another column, having sequential numbers as values, to the DataFrame:

Example

Python
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number

res_df =emp_df.select("emp_name","dept","salary",F.row_number().over(Window.partitionBy().orderBy(res_df['emp_name'])).alias("index"))

Output

This code snippet will essentially add another column at the end of the table called index, which will have sequential integer values. The final DataFrame would look as follows:

emp_namedeptsalaryindex
Abhishek
Sales

1000
1
Kumar
Legal

2000
2
SaifFinance
1200
3
SinghHR
1500
4

Using this method, you will be able to create a unique value for each row of your dimension tables.

The concepts learned in this segment will help you during the creation of the dimensions and fact tables.

Report an error