In the following video, Kamlesh will show you some steps to load data into a Redshift table.
Note:
File formats such as parque provide significant performance improvements.
With the help of this example, you understood the workings of the COPY command. Let’s now learn how learn how to use the COPY command.
As our expert explained in the video, the COPY command depends on the number of nodes assigned. The key responsibility of the COPY command is to utilise all nodes in the best manner possible.
Let’s now understand the workings of the COPY command through the example given below.
Suppose Mohan has 1,600 files to review and he has to complete the review of all the files in a few hours. This huge amount of workload is difficult for a single person to manage in the given time, and it would also impact the efficiency and quality of work.
However, if Mohan hires a team of 16 people and allocates the work equally across the team, then each person will have 100 files to review, assuming that one person can review 100 files in the given time. This way, he can ensure quality work within the given time frame.
Let’s now hear from the SME on how to use the COPY command to get data from S3 buckets in AWS Academy.
If you do not have an S3 bucket, then you can create it by following the steps mentioned in the document attached below.
You can download the data set from the URL below and save it in the local machine.
Amazon Tickit Dataset
Note:
There can be multiple reasons why a COPY command might show success without actually copying anything to Redshift such as associated IAM role having an incorrect policy, incorrect bucket name in IAM policy, or the file itself simply not being present in the S3 bucket in the first place.
Note:
For using Redshift cluster in AWS Academy, make sure that the IAM policy that you’re attaching to the cluster is “Lab Role”. The arn for the lab role used in the document below can be checked by going to the IAM service.
Now, let’s create a table and load data into it by following the steps mentioned in the document attached below.
The UNLOAD command is used to copy the results of query to a bucket. Let’s hear from the SME.
NOTE:
For the UNLOAD command to work, it’s essential that the bucket is already created. The command can automatically create the folders inside the bucket.
NOTE:
The myRedshiftRole has read-only access and hence for the UNLOAD command to work, you must modify the role with the AmazonS3FullAccess policy to it as mentioned in the video below.
Additional Reading
Data Unloading in Redshift
More about COPY command