In this segment, you will learn some concepts in Spark SQL, such as Window function, Partition and row_number, and you will understand how to use them. These concepts will help you while working on the project.
row_number() – Spark SQL
During the creation of Dimension tables, you will have to add a row number/index to each table so that a primary key can be established.
Spark SQL provides a function called row_number() as part of the window functions group, which can be used to assign a sequential integer number to each row in a DataFrame according to the partition decided.
Let’s assume that you have a dataset of employee information of a certain company. You have a dataframe called emp_df having the following data:
| emp_name | dept | salary |
| Abhishek | Sales | 1000 |
| Kumar | Legal | 2000 |
| Saif | Finance | 1200 |
| Singh | HR | 1500 |
You can use the following code to add another column, having sequential numbers as values, to the DataFrame:
Example
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import row_number
res_df =emp_df.select("emp_name","dept","salary",F.row_number().over(Window.partitionBy().orderBy(res_df['emp_name'])).alias("index"))Output
This code snippet will essentially add another column at the end of the table called index, which will have sequential integer values. The final DataFrame would look as follows:
| emp_name | dept | salary | index |
| Abhishek | Sales | 1000 | 1 |
| Kumar | Legal | 2000 | 2 |
| Saif | Finance | 1200 | 3 |
| Singh | HR | 1500 | 4 |
Using this method, you will be able to create a unique value for each row of your dimension tables.
The concepts learned in this segment will help you during the creation of the dimensions and fact tables.