[CI] problem when I removed the network cable from one node

Peter Badovinatz tabmowzo@yahoo.com
Thu, 9 Aug 2001 23:41:04 -0700 (PDT)

--- Bruce Walker <bruce@kahuna.cag.cpqcorp.net> wrote:
> All,
>   Everything Kai-Min says below is correct.  The Split-Brain
> avoidance code (SBA) utilized a serial line between the nodes;
> before doing a takeover, the line was queried. In addition, there
> are two other approaches to the problem:
>    a: STOMITH - (Shoot The Other Man In The Head);  Sistina
>       has code for this used in GFS that needs to be integrated.
>    b: multiple interconnects;  in NSC we supported having more
>       than one ethernet between nodes and to failover if one
>       path failed;  haven't ported that part yet either.
> bruce

Although the existence of multiple interconnects reduces the possibility of a
split-brain cluster, it doesn't prevent it.  Was/is a kernel panic the normal
reaction of a cluster "merge" that occured once it was determined the cluster
had been 'sundered' as opposed to nodes had died?

BTW, we adopted the use of 'sundered' and 'merged' as easy verbs to describe
the split-brain situation when we were discussing it while doing some of the
IBM clustering support :o)

One of our primary reactions to a merged cluster was always to shut down one
side or the other, then those nodes would be able to start fresh and try to
rejoin the cluster - or they would stay shut down and let the admin make sure
all data was well and rejoing them.  Biggest worry was if the admin had forced
config updates on one side or the other (or both!) during the split-brain
period.  We could detect that this had been done, but we would not force-sync.

Not as elegant as automatically merging everything, but worked as an
extraordinary reaction to an extraordinary situation.

One final note, at the heartbeating level, split-brain wasn't an issue for us,
as there were no resources, and the 'discovery' process for nodes was
essentially a series of mergers of sub-clusters until all defined nodes knew
about each other.  It was the resource management and membership code that
cared about split-brain.  We also dependend on the ability to do hardware
reserves or fencing for shared resources.


These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
and in no way should be construed as official opinion of 
IBM, Corp.

Do You Yahoo!?
Make international calls for as low as $.04/minute with Yahoo! Messenger