Enhancing CI

BACK
This document describes how to enhance the Cluster Infrastructure package
to add your own cluster services.

Much of the CI code has been rewritten to allow CI components to be added
in a cleaner, more modular fashion.  One major improvement is that you no
longer have to edit CI-specific files in order to hook in a new cluster
service.  You can declare the CI components of your new cluster service
within your own .c files and call registration routines exported by
CI.

Although, it hasn't yet been tested, the new architecture should allow CI 
components to be written as kernel modules.  There are, however, still many 
issues that need to be addressed before one can dynamically load CI components 
into a running cluster.  One of these is the fact that CI components are 
looked up by number, and not name.  Therefore, the order that CI components 
are registered must be the same clusterwide.  We guarantee this currently by 
building CI components statically into the kernel and distributing the 
same kernel across all nodes in the cluster.  Furthermore, there are 
synchronization/locking problems that need to be considered when introducing 
a new cluster service into an already running cluster.

Here are the basic steps in creating a new cluster service:
1) Write the client/server code that will be used by your service.
2) Create a .svc file to define the messages/RPCs of your service.
   (read the ICSGEN document for help and examples on doing this.)
3) Choose an ICS channel over which your cluster service will run.
   You could run over an existing ICS channel, or register one of your
   own.
4) Register your cluster service.
5) Most likely, your cluster service will want to run code for nodeup/nodedown
   events.  Register a CLMS subsystem for your service and pass in your own 
   nodeup/nodedown callback routines.
6) If you are writing a centralized cluster service (running on one node
   at a time), you can register a CLMS key service and pass in your own
   failover routines.  CLMS will automatically choose a key server node
   for your key service on bootup.  In the case that the key server node
   goes down, CLMS will automatically select a takeover node for the key
   service and run the failover routines to recreate the state on the new
   node.
7) Create a Makefile to compile your new cluster service.


Registering a new ICS Channel
==============================
1) Declare the new channel using type ics_chan_t (typedef int).

	#include <cluster/ics.h>
	ics_chan_t ics_test_chan;
	
2) Call register_ics_channel() registration routine.  This registration
   routine should be called before cluster initialization has begun.  A
   good way to hook in this call is to create a module initialization
   routine for your CI component and use the module_init() macro.  If
   your module is compiled into the kernel, the initialization will be
   called early on in the startup code, do_basic_setup().  Otherwise,
   if its a loadable module, the kernel will launch the initialization
   routine once the module is loaded.  There is currently no support
   for dynamically loading CI components into a running cluster, however,
   this is a goal we are aiming for in future releases.

	static int __init test_init(void)
	{
		return register_ics_channel(&ics_test_chan, -1);
	}
	module_init(test_init)


	register_ics_channel(ics_chan_t *ics_chan,
			     int req_num);

- ics_chan is a pointer to the ics channel identifier.  All the ics channel 
  identifiers are entered into the global array ics_channels[].
- req_num allows you to specify a specific channel number to use for this 
  channel. Entering the value -1 will dynamically assign the next available 
  ics channel number (starting from 0).  Currently, the ics channel numbers 
  must match across all nodes in the cluster, however, this is a requirement 
  we wish to remove in the future.


Registering a new Cluster Service
==================================
1) Declare the new cluster service using type cluster_svc_t (typedef int).

	#include <cluster/ics.h>
	cluster_svc_t cluster_test_svc;
	
2) Call register_cluster_svc() registration routine.  This registration
   routine should be called before cluster initialization has begun.  See
   the example above for using the module_init() macro to hook in this call.

	register_cluster_svc(cluster_svc_t *cluster_svc,
			     ics_chan_t chan_num,
			     int (*ics_svc_init)(void));

- cluster_svc_t is a pointer to the cluster service identifier.
- chan_num is the ICS channel this service will be a subservice of.
- ics_svc_init is a pointer to the cluster service initialization routine.
  This initialization routine registers the cluster service client/server
  stubs with the ICS subsystem.  The initialization routine is generated by 
  icsgen in the file icssvr_<service>_tables_gen.c.  The name of this routine
  is called ics_<service>_svc_init(), where <service> is the name of your .svc
  file.  For example, if your service is called "test" and you name
  your service definitions file test.svc, the generated initialization
  routine will be named ics_test_svc_init() and reside in the generated file
  icssvr_test_tables_gen.c.


Registering a new CLMS Subsystem
=================================
1) Call register_clms_subsys() registration routine. This registration
   routine should be called before cluster initialization has begun.  See
   the example above for using the module_init() macro to hook in this call.

	register_clms_subsys(char *s_name,                 
			     int pri,
			     int (*subsys_nodeup)(void *, clms_subsys_t,
                                  clusternode_t, clusternode_t, void *),
			     int (*subsys_nodedown)(void *, clms_subsys_t,
                                  clusternode_t, clusternode_t, void *),
			     void (*subsys_init_prejoin)(void),
			     void (*subsys_init_postjoin)(void),
			     void (*subsys_init_postroot)(void))

- s_name is the string name for the CLMS subsystem.
- pri is the CLMS priority band number.  This priority number determines the
  order that the subsystem's nodeup and nodedown callback routines are run.  
  During a NODEUP/NODEDOWN event, CLMS cycles through each priority band in 
  order (starting from the highest priority band 0) and calls back each
  subsystem's nodeup/nodedown routine.  CLMS waits for every callback routine
  in a particular priority band to finish before launching the callback
  routines of the next priority band.  Subsystems in the same priority band
  may run their callback routines in parallel, however, subsystems in
  different priority bands are always serialized.  
  
  If your subsystem does not require ordering with respect to other
  subsystems, there is a special priority band -1 you can use. Subsystems 
  in this priority band are independent of all other subsystems.  Unlike
  the other priority bands, registered routines in band -1 do not have to
  finish completion before launching the next priority band.  CLMS starts
  off the registered routines in band -1 and immediately continues
  processing the normal priority bands starting with 0.  The registered
  routines in band -1 run in parallel with all the other priority bands.
  CLMS does not wait for their completion until the last priority band has
  finished its processing.

- subsys_nodeup()/subsys_nodedown() are the subsystem's registered callback 
  routines.  These routines are called by CLMS when a NODEUP/NODEDOWN 
  cluster transition occurs.  CLMS allows multiple NODEDOWN events to happen
  in parallel, however, serializes the NODEUP events.  At all times, CLMS
  maintains a consistent clusterwide view of node membership.  Both NODEUP
  and NODEDOWN events are driven by the CLMS master node.

  The subsystem nodedown routines are called after ICS communications have 
  been torn down with the down node.  The down node is in the KCLEANUP state 
  while CLMS calls the subsystem nodedown routines.  The node stays in the
  KCLEANUP state until all nodes in the cluster have finished processing
  the subsystem nodedown routines.  Once nodedown processing has finished
  clusterwide, CLMS transitions the down node to the UCLEANUP state.  Each
  NODEDOWN event is handled in a separate thread, so it is possible for
  subsystem nodedown routines to run in parallel for different NODEDOWN
  events.  There is a time limit on how long a nodedown routine is allowed 
  to run on a particular node.  Once this limit is exceeded, the node panics.
  This limit is defined as CLMS_NDTO_PANIC_SECS in the file
  cluster/clms/clms_conf.c.  The default is set to 300 seconds (5 minutes). 

  The subsystem nodeup routines are called in parallel on all existing
  cluster nodes when a new node joins.  Only one node is allowed to join
  the cluster at a time.  The subsystem nodeup routines are called after
  all the ICS communication channels have been setup with the new node and
  it is able to exchange cluster service messages/RPCs.
  
  Note that these subsystem callback routines are called only for cluster 
  transitions of OTHER nodes in the cluster.  For example, when a new node 
  is joining the cluster, it doesn't call subsys_nodeup on itself because 
  it is a NODEUP transition of itself.  However, as it learns about 
  existing nodes in the cluster, the new node will call subsys_nodeup for 
  these nodes as part of their NODEUP transition.

  Each registered callback routine must return a value of 0.  They must 
  also issue a callback to CLMS once they have completed processing.  
  The CLMS callback functions are as follows:

  Function		Callback Function

  subsys_nodeup		clms_nodeup_callback()
  subsys_nodedown	clms_nodedown_callback()

  Each CLMS callback function is called with the parameters (void *clms_handle,
  int service, clusternode_t node).  Incidentally these 3 parameters are 
  passed in by CLMS to the subsystem's registered callback routines.  
  It is possible within the subsystem callback routines to spawn a separate 
  thread to handle the NODEUP/NODEDOWN processing.  This allows for parallel 
  execution with other subsystem callback functions in the same priority band.
  In this case, the subsys_nodeup()/subsys_nodedown() routine has to return a
  0, and the spawned thread will issue the callback to CLMS.

  Although it may seem like the last parameter (void *private) may be
  used to store private data for your subsystem, CLMS always passes in
  NULL for this, so you can ignore it for now.

- There are 3 hooks in the cluster_main_init_*() startup routines where
  initialization for clms subsystems can be called.  

  The first hook, passed in as subsys_init_prejoin(), is called prior to 
  the clms_join_cluster() call.  At this point in the bootup phase, the node 
  has only setup minimal communication channels with the CLMS master node.  
  It finds out about all existing key services in the cluster and offers 
  to provide any key services that it is capable of serving.  One CANNOT 
  assume at this point that the key service resources are available for use 
  at this point.  For an SSI cluster, the root filesystem has not yet
  been mounted at this point.  Typically, this hook is used to initialize 
  the CLMS subsystem's private data structures.

  The second hook, passed in as subsys_init_postjoin(), is called immediately
  after the clms_join_cluster() call.  At this point, the node has established
  all communication channels with the CLMS master node and the key server 
  nodes in the cluster.  At this point, the node is able to send/receive
  cluster service messages/RPCs with these limited set of nodes.  For an
  SSI cluster, the root filesystem has not yet been mounted at this point.

  The third hook, passed in as subsys_init_postroot(), is only relevent
  for an SSI cluster.  This hook is called after the root filesystem has
  been mounted.  It is provided to initialize subsystems that are dependant 
  on the root filesystem (i.e. Cluster File System).


Registering a new CLMS Key Service
===================================
1) Declare the new CLMS key service using type clms_key_svc_t (typedef int).

	#include <cluster/clms.h>
	clms_key_svc_t test_key_service; 

2) Call register_clms_key_service() registration routine.  This registration
   routine should be called before cluster initialization has begun.  See
   the example above for using the module_init() macro to hook in this call.

	register_clms_key_service(char    *name,
			  int     *service_num,
			  int     crit_flag,
			  void    (*fail)(nsc_nodelist_t *),
			  int     (*pull_data)(char *, int, int *),
			  void    (*failover_data)(clusternode_t, char *, int),
			  void    (*server_ready)(clusternode_t),
			  void    (*init_prejoin)(void),
			  void    (*init_postjoin)(void),
			  void    (*init_postroot)(void))

- name is the string name for the CLMS key service.
- service_num is a pointer to the CLMS key service identifier.
- crit_flag specifies whether or not this key service is critical.  A cluster
  cannot function without a server for all its critical key services.  This
  means the cluster will hang on bootup until it has found a server for all
  the critical key services.  Similarly, the cluster will panic if a key
  server node goes down and there is nobody to take it over.  For non-critical
  key services, the CLMS Master node will act as the "floater" node if no
  primary/secondary nodes are found for that key service.

- failover function
	void (*fail)(nsc_nodelist_t *)

  This function is run by the failover node after the primary node for
  this key service has gone down.  It is the entry point for processing
  the key service failover.  The pull_data and failover_data routines are
  optional helper functions in the failover processing.  

  Each key service failover is launched in its own thread, so they can run 
  in parallel.  CLMS does not wait for the completion of key service failover 
  processing.  It is the responsibility of the key service failover routine 
  to notify the cluster it is finished processing the failover and ready for 
  service again.  It does this by calling the function 
  clms_set_key_service_ready() at the end of its processing.

  The nsc_nodelist_t parameter is automatically passed in for you and
  contains a list of fully-up/coming-up nodes in the cluster at the time
  of the nodedown.  It is your responsibility to kfree() this structure
  before you exit the function.

  Note that CLMS key service failover is started up before calling back
  the CLMS subsystem nodedown routines.  However, there aren't any
  dependencies between the two.

- pull data function
	int (*pull_data)(char *, int, int *);

  This function is optional.  It is used by the key service during
  failover to pull state information from the surviving nodes in the 
  cluster.  If you would like to leverage this functionality, you can 
  call clms_key_svc_pull_data() from within your failover function and 
  pass it a nsc_nodelist_t.  This will in turn send an RPC to each node
  in the nsc_nodelist_t and call your key service pull data function 
  remotely on those nodes.

  The first parameter is a buffer to store the pulled data. The second
  parameter specifies the length of this buffer and the third parameter
  is the actual size required to store all the pulled data. If the
  passed in buffer length turns out to be too small, update the
  required size in the third parameter and return the error E2BIG.
  The clms_key_svc_pull_data will automatically resend the RPC with
  a buffer of the specified size.


- failover data function
	void (*failover_data)(clusternode_t, char *, int);

  This function is also optional but will be required if you decide
  to leverage the pull data functionality described above. This routine
  is run on the failover node to process the data pulled from each
  surviving node in the cluster.

  This function is called from clms_key_svc_pull_data().

  The first parameter contains the number of the node whose pulled data
  you're currently processing. The second and third parameters contains
  the pointer to the data buffer and its length, respectively.

- server ready function
	void (*server_ready)(clusternode_t);

  This function is called by clms_set_key_service_ready() and is
  run on each node in the cluster, notifying that the key service
  is now in the READY state.  CLMS will pass in the designated
  server node for this key service as the function parameter.

  It is a good idea to maintain a global clusternode_t to identify
  the key server node for your key service.  As part of the server_ready()
  function, you should assign the designated server node to this global.

	void test_server_ready(clusternode_t node)
	{
		test_node = node;
	}

3) You will need to create an initialization function for your key service
   which is run early at cluster boot time on each node.

   This function should call clms_get_key_server_node() with the wait flag 
   to determine which node in the cluster is serving as the primary for 
   this key service.  Once you've determined the key service node in the 
   cluster, you should test whether or not you're the designated primary.  
   If so, then call clms_set_key_service_ready() to set yourself to the 
   READY state and notify everyone in the cluster.  Otherwise, call
   clms_set_key_secondary_ready() to let the cluster know that you're
   avaible as a secondary (failover) node for this key service.  If your
   configuration doesn't designate you as a secondary, this function is
   a NO-OP.  Finally, call clms_waitfor_key_service() to wait until the
   key service has been set to the READY state.

	void test_key_service_init(void)
	{
		test_node = clms_get_key_server_node(test_key_service, 1);

		if (test_node == this_node) {
			clms_set_key_service_ready(test_key_service);
		}
		else {
			clms_set_key_secondary_ready(test_key_service);
			clms_waitfor_key_service(test_key_service);
		}
	}

   The init_prejoin, init_postjoin, and init_postroot hooks are called
   at the same places as described in the section on CLMS subsystems.
   If your key service initialization function doesn't require sending 
   messages/RPCs, hooking it into init_prejoin() should work fine. 
   Otherwise, hook the initialization function into init_postjoin(), 
   after communication paths have been setup with the key server nodes. 


Writing the Makefile
=====================
The Makefile for your new cluster service will include targets for a
merged .o file (O_TARGET), a list of individual file objects (obj-y), and
a list of your service definitions files (SVCFILES).  You will also
need to include the cluster/Rules.make file to process the service definitions
files.  Here's an example Makefile for a cluster service named test:

	O_TARGET := test.o

	obj-y := \
		test_init.o \
		test_client.o \
		test_server.o \
		test_misc.o

	SVCFILES := \
		test.svc

	include $(CLUSTERDIR)/Rules.make