Submission Guidelines

Submissions Required

Upload a zip file containing:

Document-01: A PDF document(SqoopDataIngestion.pdf) containing the Sqoop Code used for ingesting data from the RDS server. This should have the code for ingestion along with the screenshots of the EC₂ instance showing the list of the files generated in the HDFS cluster along with proper explanation and comments.

The Sample template for this document is as follows:

Document-02: A Jupyter Notebook(SparkETLCode.ipynb) containing the PySpark codes to read the data to Spark, Creating the Dimension and Fact tables and then loading them to S₃ bucket. The Jupyter Notebook should be properly commented and should explain all the steps taken. It should also contain appropriate Markup cells for explanation purposes.

Document-03: A PDF document(RedshiftSetup.pdf) containing the following:

Screenshots of the configuration of the Redshift cluster that you create for the project

Queries used for creating the Dimension and Fact tables on the Redshift cluster along with screenshots of the successful status of the query

Queries used for loading the data into the Dimension and Fact tables in the Redshift cluster from the S₃ bucket along with screenshots of the successful status of the query

The Sample template for this document is as follows:

Document-04: A PDF document(RedshiftQueries.pdf) containing the queries used for solving the Analytical queries in the Redshift cluster. It should also have the screenshots of the first page of the Tables which are produced after running the queries on Redshift.

The Sample template for this document is as follows:

Please make sure that you are not changing any of the file names that have been provided above in brackets. The code that you are submitting should run at our end without any modifications in the code.

Make sure that you have not made any changes to the original data set. You will be graded based on the queries and documentation submitted.

Report an error