In the next video, our expert will walk you through the entire Solution Approach for this project.
As discussed in the video, the tasks needed to be performed for this project are as follows:
- Data ingestion from RDS to HDFS using Sqoop.
- Sqoop import command.
- Command to see the list of imported data in HDFS.
- Sqoop import command.
For this step, the RDS connection string and credentials are as follows:
- RDS Connection String –
- Username – student
- Password – STUDENT123
- Table Name – SRC_ATM_TRANS
PySpark
- Reading the data from the files in HDFS by a specific schema useing PySpark.
- Command to create an input schema using StructType(We recommend you to create a custom schema using the StructType class of PySpark, to avoid any data type mismatch.)
- Commands to read the data using the input schema created and verifying the data using the count function.
- Command to create an input schema using StructType(We recommend you to create a custom schema using the StructType class of PySpark, to avoid any data type mismatch.)
- Creation of dimension tables using PySpark.
- Command to create a data frame for the dimension according to the target schema(dimension model) provided.
- Commands to clean and transform the data:
- Command to create a data frame for the dimension according to the target schema(dimension model) provided.
- Making sure that duplicate records are cleaned(Avoid duplicate info especially in the dimension tables.
- Making sure that appropriate primary keys are present for the dimensions( You need to generate a primary key for each dimension table. For example for the ‘Date‘ dimension one way to generate the primary key can be by adding “date” as the prefix to the row number i.e. ‘date1’, ‘date2’ and so on.)
- Rearranging the fields if necessary(According to the target schema)
Note
Here, the tasks given above have to be done for all four dimension tables.
- Creation of transaction fact table using PySpark.
- Commands to set proper alias for the various PySpark DataFrames before proceeding with creating the fact table (optional).
- Commands for various stages where the original data frame is appropriately joined with the dimension tables created above.
- Commands to clean and transform the data:
- Commands for various stages where the original data frame is appropriately joined with the dimension tables created above.
- Commands to set proper alias for the various PySpark DataFrames before proceeding with creating the fact table (optional).
- Making sure that the appropriate primary key is present for the fact table.
- Rearranging the fields if necessary.
- Loading the dimension and fact tables into Amazon S3 bucket.
- Write the DataFrames containing the dimensions and fact table directly to an S3 bucket folder.
[You should create different folders on your S3 bucket for different dimensions and fact table.]
Redshift
- Creation of a Redshift Cluster.
- You need to create a Redshift cluster in the same way as it was done in the Amazon Redshift module.
- Setting up a database in the Redshift cluster and running queries to create the dimension and fact tables.
- Queries to create the various dimension and fact tables with appropriate primary and foreign keys.
- Loading data into a Redshift cluster from Amazon S3 bucket.
- Queries to copy the data from S3 buckets to the Redshift cluster in the appropriate tables.
- Using queries on a Redshift cluster to find the solution to the following analytical queries.
- Top 10 ATMs where most transactions are in the ’inactive’ state.
- Number of ATM failures corresponding to the different weather conditions recorded at the time of the transactions.
- Top 10 ATMs with the most number of transactions throughout the year.
- Number of overall ATM transactions going inactive per month for each month.
- Top 10 ATMs with the highest total amount withdrawn throughout the year.
- Number of failed ATM transactions across various card types.
- Top 10 records with the number of transactions ordered by the ATM_number, ATM_manufacturer, location, weekend_flag and then total_transaction_count, on weekdays and on weekends throughout the year.
- Most active day in each ATMs from location “Vejgaard”.
- Top 10 records with the number of transactions ordered by the ATM_number, ATM_manufacturer, location, weekend_flag and then total_transaction_count, on weekdays and on weekends throughout the year.
- Number of failed ATM transactions across various card types.
- Top 10 ATMs with the highest total amount withdrawn throughout the year.
- Number of overall ATM transactions going inactive per month for each month.
- Top 10 ATMs with the most number of transactions throughout the year.
- Number of ATM failures corresponding to the different weather conditions recorded at the time of the transactions.
- Top 10 ATMs where most transactions are in the ’inactive’ state.
***
Important Note: Amazon Redshift is a costly service of AWS. Hence to avoid burning-up your monthly AWS budget, Please make sure to Pause your Redshift cluster during the project. And make sure to terminate it once you are done with the project.
Please follow the steps below to delete the Redshift cluster. This must be done to avoid burning up your AWS budget.
The steps to create a Redshift cluster are straight forward, hence we would highly recommend you to create a new cluster every time you want to go through the RedShift Analysis Queries(if possible) and delete the cluster once you are done. Please plan your practice accordingly so that you are able to sit in long stretch for the practice.
***
For analytical tasks at the end, the tasks to be performed are to provide the appropriate queries along with the screenshot of the result (limit to the first page of the query result page if the result is bigger than that). All the Queries are designed in a way that the maximum records that should be outputted should not be more than 15.
Refer to the sample solution format given in the ‘Submission’ segment of the next ‘Submission Guidelines’ session to get more clarity on the things to be submitted.
Finally, in the next video, our expert would like to conclude the introduction of the project and also give a few tips before you start developing the ETL project.
The following document is a Validation document which you can use to verify the correctness of your solution at the various steps of the project.
This is a Hint document which will help you while you are working on the project.