[CI] problem when I removed the network cable from one node

Bruce Walker bruce@kahuna.cag.cpqcorp.net
Mon, 13 Aug 2001 11:34:16 -0700 (PDT)


Dave,
  Perhaps I think of STOMITH in an expanded way but I see it as an
SBA tool.  Considering just the 2-node case, Split-brain can occur
for the following reasons:
  a. on startup the two nodes don't see each other and assume the 
    other is dead and set themselves up;
  b. one node is up as a cluster;  the other boots but does not
    see the first node and sets itself up as the cluster;
  c. both nodes are up and talking and communication ceases so
    each thinks the other is down and "runs" the cluster.

Having a second or third communication path between the nodes
decreases the chances of split brain.  One potential
communication path (direct or indirect) is often the disk, which
could also be thought of as a third member of cluster (for voting
purposes).

I think we could/should have a longer discussion on how to use the
disk subsystem or disk pathways for cluster membership but for now
my point is that for many configurations you can avoid split-brain
by either shooting the other node or preventing him from accessing
a critical resource (eg. disk).  An appropriate application of
STOMITH can do this.  Used as a membership tool, how do you like
the following algorithm for case "c" above:
   a. having noticed we can't talk to the other node, try an
	alternate communication path.
   b. if the alternate fails of there is none, try a serial cable
	just to see if the other guy is alive;  if he is, decide which
	of you is going to continue the cluster and which is going to
	reboot/halt.
   c. still can't talk to him? shoot him or take away his disks.  
	Obviously if both nodes do this at exactly the same time they
	might both die so staggering helps;  my assumption here is
	that taking away someone disks will cause them to notice and
	die.

> 
> Because my exposure to SBA systems and Quorum systems is limited, I'm really
> not an expert on this, so please fill me in if I'm missing something here.
> 
> I understand SBA and Quorum algorithms as doing similar things.  They are
> employed by cluster managers to determine the operating state of the cluster.
> Based on this, the cluster manager will support further system functions it
> has control of, or disable them.
> 
> I/O fencing (STOMITH) is something different.  Cluster managers work fine
> without STOMITH.  When a node fails the cluster manager invokes various
> recovery steps.  One recovery step may involve a resource shared directly by
> nodes (like a FC or SCSI disk).  GFS recovery falls in this category. The
> recovery proceedure for a resource like this needs to begin with STOMITH.
> 
> The answer to why STOMITH needs to be the first step in shared resource
> recovery is another topic which is simple but often not completely understood.

My assumption is that you can't release any locks the down node had on the
resource until you are sure the node is really down (no more i/o's going
to happen);  then you have to "recover" the resource (replay the log) before
you release so other nodes can't do conflicting operations on the filesystem
while the log is being replayed.  Am I correct?

> 
> In an arrangement where a software layer can be programmed on the side of the
> shared resource, STOMITH simplifies to a situation where the software in front
> of the shared resource blocks access from a STOMITH victim.  This is so simple
> that it's often not even explicitly pointed out in systems which do it.

Can't we consider that this software is participating in the membership
descision or at least the membership enforcement.  With the assumption that
a node will die or at least not act in an split-brain fashion when
i/o fenced away from the shared resource, I see this STOMITH variant as
an important SBA tool.
> 
> So, a cluster where STOMITH happens to be a part of a recovery step still
> requires some sort of SBA or Quorum in the cluster manager.

I think there is synergy to allow STOMITH to be part of SBA.  Do you
still disagree?

bruce
> 
> -- 
> Dave Teigland  <teigland@sistina.com>
> _______________________________________________
> ci-linux-devel mailing list
> ci-linux-devel@opensource.compaq.com
> http://www.opensource.compaq.com/mailman/listinfo/ci-linux-devel