Securing your system when the roof is on fire with High Availability Clusters
Fault tolerance systems are defined by their ability to continue operating in the event of a component failure. Essentially, fault tolerant systems need to be able to continue processing data no matter the situation (even if the system is on fire). So how do we ensure the data processing continues?
System developers must build duplicate hardware of all critical components of a system and teach the software to re-route the data flow to the alternative hardware once a failure is detected. This comes with several challenges including ensuring the software reacts only when needed and is successful in transferring the software operations to the duplicate hardware.
In a High Performance Embedded Computing (HPEC) cluster, there are compute nodes and the cluster manager, which is also known as the head node. The “head node” is the connection between HPEC cluster and the external network. It controls all other devices and eases the administration of the compute nodes. This node provisioning by the cluster manager simplifies replacing a compute node in the event of a hardware failure. This decreases the risk of any errors and allows for a confident node replacement even when the rest of the system may be failing.
While the head node offers us a secure and reliable solution during a hardware failure, the downside remains that the head node is a single point of failure for the entire system.
What is the solution? A high availability setup derived from the HPC world. Download the white paper HPEC: High Availability by Design to learn more about:
- High Availability clusters
- Fault Tolerance Software
- HPC applications for HPEC
- Cluster Managers
- The STONITH process