[CI] problem when I removed the network cable from one node

Bruce Walker bruce@kahuna.cag.cpqcorp.net
Thu, 9 Aug 2001 15:07:55 -0700 (PDT)


All,
  Everything Kai-Min says below is correct.  The Split-Brain
avoidance code (SBA) utilized a serial line between the nodes;
before doing a takeover, the line was queried. In addition, there
are two other approaches to the problem:
   a: STOMITH - (Shoot The Other Man In The Head);  Sistina
      has code for this used in GFS that needs to be integrated.
   b: multiple interconnects;  in NSC we supported having more
      than one ethernet between nodes and to failover if one
      path failed;  haven't ported that part yet either.

bruce


> Aneesh,
>     What you've done is simulate a Split-Brain scenario.  Unfortunately,
> CI currently doesn't have any Split-Brain avoidance code.  After you
> disconnected Node 2's cable, Node 2 thought Node 1 went down and
> therefore failed over to become the CLMS master.  Node 1, on the other
> hand, thought Node 2 went down and processed a Nodedown event for it.
> At this point, you have two CLMS masters (also known as Split-Brain).
> When network connectivity is re-established, each node probes the other
> and realizes their view of the CLMS master is different.  Without
> split-brain avoidance code, the algorithm currently favors the lower
> numbered node as the CLMS master.  Therefore, the panic you're seeing on
> Node 2 is the correct behavior.  The original Unixware Non-Stop Clusters
> code had support for split-brain avoidance, however this code has not
> yet been ported.  If anyone is interested in tackling this as a side
> project, I'd be happy to send you some of the code.
> 
> Kai-Min Sung
> CI/SSI-Linux Developer
> kai-min.sung@compaq.com
> 
> Aneesh Kumar wrote:
> 
> > Hi ,
> >
> > Today  something 'strange' happened. I was actually
> > writing some code that will inform me about the adding
> > and removal of nodes in the cluster. To test the same
> > i ran the binary on one machine . It showed  both the
> > node up . Now i removed the network cable from node 2.
> > My monitoring program which was running on node one
> > showed node 2 has gone. Fine happy . But then this
> > 'strange' thing happened. For node 2 it is node one
> > that is gone. So it became the root node by itself.
> > Now when i try to put the  network cable back node 2
> > gave me a kernel panic !!!!!!!!!!!!!.  I know it is an
> > expected behaviour. But then how will we take care of
> > network failur like the above. If in   a cluster some
> > one remove the cable of one of the machine, what will
> > happen ?
> >
> >  -aneesh
> >
> > ____________________________________________________________
> > Do You Yahoo!?
> > Get your free @yahoo.co.uk address at http://mail.yahoo.co.uk
> > or your free @yahoo.ie address at http://mail.yahoo.ie
> > _______________________________________________
> > ci-linux-devel mailing list
> > ci-linux-devel@opensource.compaq.com
> > http://www.opensource.compaq.com/mailman/listinfo/ci-linux-devel
> 
> _______________________________________________
> ci-linux-devel mailing list
> ci-linux-devel@opensource.compaq.com
> http://www.opensource.compaq.com/mailman/listinfo/ci-linux-devel