[CI] problem when I removed the network cable from one node
Bruce Walker
bruce@kahuna.cag.cpqcorp.net
Thu, 9 Aug 2001 15:07:55 -0700 (PDT)
All,
Everything Kai-Min says below is correct. The Split-Brain
avoidance code (SBA) utilized a serial line between the nodes;
before doing a takeover, the line was queried. In addition, there
are two other approaches to the problem:
a: STOMITH - (Shoot The Other Man In The Head); Sistina
has code for this used in GFS that needs to be integrated.
b: multiple interconnects; in NSC we supported having more
than one ethernet between nodes and to failover if one
path failed; haven't ported that part yet either.
bruce
> Aneesh,
> What you've done is simulate a Split-Brain scenario. Unfortunately,
> CI currently doesn't have any Split-Brain avoidance code. After you
> disconnected Node 2's cable, Node 2 thought Node 1 went down and
> therefore failed over to become the CLMS master. Node 1, on the other
> hand, thought Node 2 went down and processed a Nodedown event for it.
> At this point, you have two CLMS masters (also known as Split-Brain).
> When network connectivity is re-established, each node probes the other
> and realizes their view of the CLMS master is different. Without
> split-brain avoidance code, the algorithm currently favors the lower
> numbered node as the CLMS master. Therefore, the panic you're seeing on
> Node 2 is the correct behavior. The original Unixware Non-Stop Clusters
> code had support for split-brain avoidance, however this code has not
> yet been ported. If anyone is interested in tackling this as a side
> project, I'd be happy to send you some of the code.
>
> Kai-Min Sung
> CI/SSI-Linux Developer
> kai-min.sung@compaq.com
>
> Aneesh Kumar wrote:
>
> > Hi ,
> >
> > Today something 'strange' happened. I was actually
> > writing some code that will inform me about the adding
> > and removal of nodes in the cluster. To test the same
> > i ran the binary on one machine . It showed both the
> > node up . Now i removed the network cable from node 2.
> > My monitoring program which was running on node one
> > showed node 2 has gone. Fine happy . But then this
> > 'strange' thing happened. For node 2 it is node one
> > that is gone. So it became the root node by itself.
> > Now when i try to put the network cable back node 2
> > gave me a kernel panic !!!!!!!!!!!!!. I know it is an
> > expected behaviour. But then how will we take care of
> > network failur like the above. If in a cluster some
> > one remove the cable of one of the machine, what will
> > happen ?
> >
> > -aneesh
> >
> > ____________________________________________________________
> > Do You Yahoo!?
> > Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
> > or your free @yahoo.ie address at http://mail.yahoo.ie
> > _______________________________________________
> > ci-linux-devel mailing list
> > ci-linux-devel@opensource.compaq.com
> > http://www.opensource.compaq.com/mailman/listinfo/ci-linux-devel
>
> _______________________________________________
> ci-linux-devel mailing list
> ci-linux-devel@opensource.compaq.com
> http://www.opensource.compaq.com/mailman/listinfo/ci-linux-devel