I was just catching up on my reading and found an excellent post on Kirk McGowan’s blog discussing Oracle Clusterware’s fencing mechanisms. As Kirk details, there are many theories regarding the effectiveness and safety of Oracle’s fencing approach and he provides his usual no-nonsense responses to those theories.
Incase you are lost, a little background may helpful. Fencing (generally speaking) is a mechanism employed by clusterware software to force one or more nodes out of a cluster in the event of a problem. The problems can be, and usually are, serious ones and if fencing algorithms weren’t included, it is likely that most clusters would implode and be very unstable. There are many different approaches to fencing. Some vendors provide I/O fencing which works with the storage to stop any I/O from the node being evicted from the cluster and therefore, prevents corruption to the cluster filesystem and/or database files residing in non-filesystem storage (like ASM or RAW). Oracle performs fencing at the node-level and it uses a modified algorithm known as STONITH (Shoot The Other Node In The Head). As Kirk explains, since there are not easily-accessible APIs to do remote power-off for other cluster nodes, Oracle Clusterware instead uses node suicide where instead of kicking the other node out of the cluster, it removes itself by rebooting. Presumably, when the node restarts, if there is some persistent failure, the node won’t be able to rejoin the cluster and administrator intervention will be required to resolve the problem.
Anyway, Kirk’s treatment of the topic is great and I learned a lot (as I often do when listening to Kirk). Thanks for a great article (and your usual wit) Kirk!
Very good