Solution Approach and Tasks

In the next video, our expert will walk you through the entire Solution Approach for this project.

As discussed in the video, the tasks needed to be performed for this project are as follows:

Data ingestion from RDS to HDFS using Sqoop.
- Sqoop import command.
  - Command to see the list of imported data in HDFS.

For this step, the RDS connection string and credentials are as follows:

RDS Connection String –

Username – student

Password – STUDENT₁₂₃

Table Name – SRC_ATM_TRANS

PySpark

Reading the data from the files in HDFS by a specific schema useing PySpark.
- Command to create an input schema using StructType(We recommend you to create a custom schema using the StructType class of PySpark, to avoid any data type mismatch.)
  - Commands to read the data using the input schema created and verifying the data using the count function.

Creation of dimension tables using PySpark.
- Command to create a data frame for the dimension according to the target schema(dimension model) provided.
  - Commands to clean and transform the data:

Making sure that duplicate records are cleaned(Avoid duplicate info especially in the dimension tables.

Making sure that appropriate primary keys are present for the dimensions( You need to generate a primary key for each dimension table. For example for the ‘Date‘ dimension one way to generate the primary key can be by adding “date” as the prefix to the row number i.e. ‘date1’, ‘date2’ and so on.)

Rearranging the fields if necessary(According to the target schema)

Note

Here, the tasks given above have to be done for all four dimension tables.

Creation of transaction fact table using PySpark.
- Commands to set proper alias for the various PySpark DataFrames before proceeding with creating the fact table (optional).
  - Commands for various stages where the original data frame is appropriately joined with the dimension tables created above.
    - Commands to clean and transform the data:

Making sure that the appropriate primary key is present for the fact table.

Rearranging the fields if necessary.

Loading the dimension and fact tables into Amazon S₃ bucket.
- Write the DataFrames containing the dimensions and fact table directly to an S₃ bucket folder.

[You should create different folders on your S3 bucket for different dimensions and fact table.]

Redshift

Creation of a Redshift Cluster.
- You need to create a Redshift cluster in the same way as it was done in the Amazon Redshift module.

Setting up a database in the Redshift cluster and running queries to create the dimension and fact tables.
- Queries to create the various dimension and fact tables with appropriate primary and foreign keys.

Loading data into a Redshift cluster from Amazon S₃ bucket.
- Queries to copy the data from S₃ buckets to the Redshift cluster in the appropriate tables.

Using queries on a Redshift cluster to find the solution to the following analytical queries.
- Top ₁₀ ATMs where most transactions are in the ’inactive’ state.
  - Number of ATM failures corresponding to the different weather conditions recorded at the time of the transactions.
    - Top ₁₀ ATMs with the most number of transactions throughout the year.
      - Number of overall ATM transactions going inactive per month for each month.
        
        Top ₁₀ ATMs with the highest total amount withdrawn throughout the year.
        
        Number of failed ATM transactions across various card types.
        
        Top ₁₀ records with the number of transactions ordered by the ATM_number, ATM_manufacturer, location, weekend_flag and then total_transaction_count, on weekdays and on weekends throughout the year.
        
        Most active day in each ATMs from location “Vejgaard”.

***

Important Note: Amazon Redshift is a costly service of AWS. Hence to avoid burning-up your monthly AWS budget, Please make sure to Pause your Redshift cluster during the project. And make sure to terminate it once you are done with the project.

Please follow the steps below to delete the Redshift cluster. This must be done to avoid burning up your AWS budget.

The steps to create a Redshift cluster are straight forward, hence we would highly recommend you to create a new cluster every time you want to go through the RedShift Analysis Queries(if possible) and delete the cluster once you are done. Please plan your practice accordingly so that you are able to sit in long stretch for the practice.

***

For analytical tasks at the end, the tasks to be performed are to provide the appropriate queries along with the screenshot of the result (limit to the first page of the query result page if the result is bigger than that). All the Queries are designed in a way that the maximum records that should be outputted should not be more than ₁₅.

Refer to the sample solution format given in the ‘Submission’ segment of the next ‘Submission Guidelines’ session to get more clarity on the things to be submitted.

Finally, in the next video, our expert would like to conclude the introduction of the project and also give a few tips before you start developing the ETL project.

The following document is a Validation document which you can use to verify the correctness of your solution at the various steps of the project.

This is a Hint document which will help you while you are working on the project.

Report an error