So far, you have learnt that Redshift uses a cluster of nodes as a means to distribute data and also to parallelise the computation. In general, with each component that is added to a cluster of machines, the odds of failure increase. Each failure has the potential to bring the entire system to a halt, or even worse, to lose data. This makes fault tolerance an essential aspect of Redshift. In the next video, let’s hear from our SME, as he explains how fault tolerance is implemented in Redshift.
When you load data, Redshift synchronously replicates this data to other disks in the cluster. After that, the data is automatically replicated to S3 to provide continuous and incremental backups. Note that synchronous replication is not supported in a single node cluster, as there is nowhere to replicate data to. Thus, it is recommended to run at least two nodes in production.
A Redshift cluster is actively monitored for disk and node failures. In case of a disk failure on a single node, the node automatically starts using an in-cluster replica of the failing drive, while Redshift starts preparing a new healthy drive. When a node dies, Redshift stops serving queries until the cluster is healthy again. It will automatically do the provisioning and configuring of a new node and resume operations only when data is consistent again.
In theory, you can expect an unhealthy cluster to recover itself without any intervention. Depending on the consequence of the failure, some downtime is to be expected. Also, Redshift favours consistency over availability.