BACK
This document describes how to enhance the Cluster Infrastructure package
to add your own cluster services.
Much of the CI code has been rewritten to allow CI components to be added
in a cleaner, more modular fashion. One major improvement is that you no
longer have to edit CI-specific files in order to hook in a new cluster
service. You can declare the CI components of your new cluster service
within your own .c files and call registration routines exported by
CI.
Although, it hasn't yet been tested, the new architecture should allow CI
components to be written as kernel modules. There are, however, still many
issues that need to be addressed before one can dynamically load CI components
into a running cluster. One of these is the fact that CI components are
looked up by number, and not name. Therefore, the order that CI components
are registered must be the same clusterwide. We guarantee this currently by
building CI components statically into the kernel and distributing the
same kernel across all nodes in the cluster. Furthermore, there are
synchronization/locking problems that need to be considered when introducing
a new cluster service into an already running cluster.
Here are the basic steps in creating a new cluster service:
1) Write the client/server code that will be used by your service.
2) Create a .svc file to define the messages/RPCs of your service.
(read the ICSGEN document for help and examples on doing this.)
3) Choose an ICS channel over which your cluster service will run.
You could run over an existing ICS channel, or register one of your
own.
4) Register your cluster service.
5) Most likely, your cluster service will want to run code for nodeup/nodedown
events. Register a CLMS subsystem for your service and pass in your own
nodeup/nodedown callback routines.
6) If you are writing a centralized cluster service (running on one node
at a time), you can register a CLMS key service and pass in your own
failover routines. CLMS will automatically choose a key server node
for your key service on bootup. In the case that the key server node
goes down, CLMS will automatically select a takeover node for the key
service and run the failover routines to recreate the state on the new
node.
7) Create a Makefile to compile your new cluster service.
Registering a new ICS Channel
==============================
1) Declare the new channel using type ics_chan_t (typedef int).
#include <cluster/ics.h>
ics_chan_t ics_test_chan;
2) Call register_ics_channel() registration routine. This registration
routine should be called before cluster initialization has begun. A
good way to hook in this call is to create a module initialization
routine for your CI component and use the module_init() macro. If
your module is compiled into the kernel, the initialization will be
called early on in the startup code, do_basic_setup(). Otherwise,
if its a loadable module, the kernel will launch the initialization
routine once the module is loaded. There is currently no support
for dynamically loading CI components into a running cluster, however,
this is a goal we are aiming for in future releases.
static int __init test_init(void)
{
return register_ics_channel(&ics_test_chan, -1);
}
module_init(test_init)
register_ics_channel(ics_chan_t *ics_chan,
int req_num);
- ics_chan is a pointer to the ics channel identifier. All the ics channel
identifiers are entered into the global array ics_channels[].
- req_num allows you to specify a specific channel number to use for this
channel. Entering the value -1 will dynamically assign the next available
ics channel number (starting from 0). Currently, the ics channel numbers
must match across all nodes in the cluster, however, this is a requirement
we wish to remove in the future.
Registering a new Cluster Service
==================================
1) Declare the new cluster service using type cluster_svc_t (typedef int).
#include <cluster/ics.h>
cluster_svc_t cluster_test_svc;
2) Call register_cluster_svc() registration routine. This registration
routine should be called before cluster initialization has begun. See
the example above for using the module_init() macro to hook in this call.
register_cluster_svc(cluster_svc_t *cluster_svc,
ics_chan_t chan_num,
int (*ics_svc_init)(void));
- cluster_svc_t is a pointer to the cluster service identifier.
- chan_num is the ICS channel this service will be a subservice of.
- ics_svc_init is a pointer to the cluster service initialization routine.
This initialization routine registers the cluster service client/server
stubs with the ICS subsystem. The initialization routine is generated by
icsgen in the file icssvr_<service>_tables_gen.c. The name of this routine
is called ics_<service>_svc_init(), where <service> is the name of your .svc
file. For example, if your service is called "test" and you name
your service definitions file test.svc, the generated initialization
routine will be named ics_test_svc_init() and reside in the generated file
icssvr_test_tables_gen.c.
Registering a new CLMS Subsystem
=================================
1) Call register_clms_subsys() registration routine. This registration
routine should be called before cluster initialization has begun. See
the example above for using the module_init() macro to hook in this call.
register_clms_subsys(char *s_name,
int pri,
int (*subsys_nodeup)(void *, clms_subsys_t,
clusternode_t, clusternode_t, void *),
int (*subsys_nodedown)(void *, clms_subsys_t,
clusternode_t, clusternode_t, void *),
void (*subsys_init_prejoin)(void),
void (*subsys_init_postjoin)(void),
void (*subsys_init_postroot)(void))
- s_name is the string name for the CLMS subsystem.
- pri is the CLMS priority band number. This priority number determines the
order that the subsystem's nodeup and nodedown callback routines are run.
During a NODEUP/NODEDOWN event, CLMS cycles through each priority band in
order (starting from the highest priority band 0) and calls back each
subsystem's nodeup/nodedown routine. CLMS waits for every callback routine
in a particular priority band to finish before launching the callback
routines of the next priority band. Subsystems in the same priority band
may run their callback routines in parallel, however, subsystems in
different priority bands are always serialized.
If your subsystem does not require ordering with respect to other
subsystems, there is a special priority band -1 you can use. Subsystems
in this priority band are independent of all other subsystems. Unlike
the other priority bands, registered routines in band -1 do not have to
finish completion before launching the next priority band. CLMS starts
off the registered routines in band -1 and immediately continues
processing the normal priority bands starting with 0. The registered
routines in band -1 run in parallel with all the other priority bands.
CLMS does not wait for their completion until the last priority band has
finished its processing.
- subsys_nodeup()/subsys_nodedown() are the subsystem's registered callback
routines. These routines are called by CLMS when a NODEUP/NODEDOWN
cluster transition occurs. CLMS allows multiple NODEDOWN events to happen
in parallel, however, serializes the NODEUP events. At all times, CLMS
maintains a consistent clusterwide view of node membership. Both NODEUP
and NODEDOWN events are driven by the CLMS master node.
The subsystem nodedown routines are called after ICS communications have
been torn down with the down node. The down node is in the KCLEANUP state
while CLMS calls the subsystem nodedown routines. The node stays in the
KCLEANUP state until all nodes in the cluster have finished processing
the subsystem nodedown routines. Once nodedown processing has finished
clusterwide, CLMS transitions the down node to the UCLEANUP state. Each
NODEDOWN event is handled in a separate thread, so it is possible for
subsystem nodedown routines to run in parallel for different NODEDOWN
events. There is a time limit on how long a nodedown routine is allowed
to run on a particular node. Once this limit is exceeded, the node panics.
This limit is defined as CLMS_NDTO_PANIC_SECS in the file
cluster/clms/clms_conf.c. The default is set to 300 seconds (5 minutes).
The subsystem nodeup routines are called in parallel on all existing
cluster nodes when a new node joins. Only one node is allowed to join
the cluster at a time. The subsystem nodeup routines are called after
all the ICS communication channels have been setup with the new node and
it is able to exchange cluster service messages/RPCs.
Note that these subsystem callback routines are called only for cluster
transitions of OTHER nodes in the cluster. For example, when a new node
is joining the cluster, it doesn't call subsys_nodeup on itself because
it is a NODEUP transition of itself. However, as it learns about
existing nodes in the cluster, the new node will call subsys_nodeup for
these nodes as part of their NODEUP transition.
Each registered callback routine must return a value of 0. They must
also issue a callback to CLMS once they have completed processing.
The CLMS callback functions are as follows:
Function Callback Function
subsys_nodeup clms_nodeup_callback()
subsys_nodedown clms_nodedown_callback()
Each CLMS callback function is called with the parameters (void *clms_handle,
int service, clusternode_t node). Incidentally these 3 parameters are
passed in by CLMS to the subsystem's registered callback routines.
It is possible within the subsystem callback routines to spawn a separate
thread to handle the NODEUP/NODEDOWN processing. This allows for parallel
execution with other subsystem callback functions in the same priority band.
In this case, the subsys_nodeup()/subsys_nodedown() routine has to return a
0, and the spawned thread will issue the callback to CLMS.
Although it may seem like the last parameter (void *private) may be
used to store private data for your subsystem, CLMS always passes in
NULL for this, so you can ignore it for now.
- There are 3 hooks in the cluster_main_init_*() startup routines where
initialization for clms subsystems can be called.
The first hook, passed in as subsys_init_prejoin(), is called prior to
the clms_join_cluster() call. At this point in the bootup phase, the node
has only setup minimal communication channels with the CLMS master node.
It finds out about all existing key services in the cluster and offers
to provide any key services that it is capable of serving. One CANNOT
assume at this point that the key service resources are available for use
at this point. For an SSI cluster, the root filesystem has not yet
been mounted at this point. Typically, this hook is used to initialize
the CLMS subsystem's private data structures.
The second hook, passed in as subsys_init_postjoin(), is called immediately
after the clms_join_cluster() call. At this point, the node has established
all communication channels with the CLMS master node and the key server
nodes in the cluster. At this point, the node is able to send/receive
cluster service messages/RPCs with these limited set of nodes. For an
SSI cluster, the root filesystem has not yet been mounted at this point.
The third hook, passed in as subsys_init_postroot(), is only relevent
for an SSI cluster. This hook is called after the root filesystem has
been mounted. It is provided to initialize subsystems that are dependant
on the root filesystem (i.e. Cluster File System).
Registering a new CLMS Key Service
===================================
1) Declare the new CLMS key service using type clms_key_svc_t (typedef int).
#include <cluster/clms.h>
clms_key_svc_t test_key_service;
2) Call register_clms_key_service() registration routine. This registration
routine should be called before cluster initialization has begun. See
the example above for using the module_init() macro to hook in this call.
register_clms_key_service(char *name,
int *service_num,
int crit_flag,
void (*fail)(nsc_nodelist_t *),
int (*pull_data)(char *, int, int *),
void (*failover_data)(clusternode_t, char *, int),
void (*server_ready)(clusternode_t),
void (*init_prejoin)(void),
void (*init_postjoin)(void),
void (*init_postroot)(void))
- name is the string name for the CLMS key service.
- service_num is a pointer to the CLMS key service identifier.
- crit_flag specifies whether or not this key service is critical. A cluster
cannot function without a server for all its critical key services. This
means the cluster will hang on bootup until it has found a server for all
the critical key services. Similarly, the cluster will panic if a key
server node goes down and there is nobody to take it over. For non-critical
key services, the CLMS Master node will act as the "floater" node if no
primary/secondary nodes are found for that key service.
- failover function
void (*fail)(nsc_nodelist_t *)
This function is run by the failover node after the primary node for
this key service has gone down. It is the entry point for processing
the key service failover. The pull_data and failover_data routines are
optional helper functions in the failover processing.
Each key service failover is launched in its own thread, so they can run
in parallel. CLMS does not wait for the completion of key service failover
processing. It is the responsibility of the key service failover routine
to notify the cluster it is finished processing the failover and ready for
service again. It does this by calling the function
clms_set_key_service_ready() at the end of its processing.
The nsc_nodelist_t parameter is automatically passed in for you and
contains a list of fully-up/coming-up nodes in the cluster at the time
of the nodedown. It is your responsibility to kfree() this structure
before you exit the function.
Note that CLMS key service failover is started up before calling back
the CLMS subsystem nodedown routines. However, there aren't any
dependencies between the two.
- pull data function
int (*pull_data)(char *, int, int *);
This function is optional. It is used by the key service during
failover to pull state information from the surviving nodes in the
cluster. If you would like to leverage this functionality, you can
call clms_key_svc_pull_data() from within your failover function and
pass it a nsc_nodelist_t. This will in turn send an RPC to each node
in the nsc_nodelist_t and call your key service pull data function
remotely on those nodes.
The first parameter is a buffer to store the pulled data. The second
parameter specifies the length of this buffer and the third parameter
is the actual size required to store all the pulled data. If the
passed in buffer length turns out to be too small, update the
required size in the third parameter and return the error E2BIG.
The clms_key_svc_pull_data will automatically resend the RPC with
a buffer of the specified size.
- failover data function
void (*failover_data)(clusternode_t, char *, int);
This function is also optional but will be required if you decide
to leverage the pull data functionality described above. This routine
is run on the failover node to process the data pulled from each
surviving node in the cluster.
This function is called from clms_key_svc_pull_data().
The first parameter contains the number of the node whose pulled data
you're currently processing. The second and third parameters contains
the pointer to the data buffer and its length, respectively.
- server ready function
void (*server_ready)(clusternode_t);
This function is called by clms_set_key_service_ready() and is
run on each node in the cluster, notifying that the key service
is now in the READY state. CLMS will pass in the designated
server node for this key service as the function parameter.
It is a good idea to maintain a global clusternode_t to identify
the key server node for your key service. As part of the server_ready()
function, you should assign the designated server node to this global.
void test_server_ready(clusternode_t node)
{
test_node = node;
}
3) You will need to create an initialization function for your key service
which is run early at cluster boot time on each node.
This function should call clms_get_key_server_node() with the wait flag
to determine which node in the cluster is serving as the primary for
this key service. Once you've determined the key service node in the
cluster, you should test whether or not you're the designated primary.
If so, then call clms_set_key_service_ready() to set yourself to the
READY state and notify everyone in the cluster. Otherwise, call
clms_set_key_secondary_ready() to let the cluster know that you're
avaible as a secondary (failover) node for this key service. If your
configuration doesn't designate you as a secondary, this function is
a NO-OP. Finally, call clms_waitfor_key_service() to wait until the
key service has been set to the READY state.
void test_key_service_init(void)
{
test_node = clms_get_key_server_node(test_key_service, 1);
if (test_node == this_node) {
clms_set_key_service_ready(test_key_service);
}
else {
clms_set_key_secondary_ready(test_key_service);
clms_waitfor_key_service(test_key_service);
}
}
The init_prejoin, init_postjoin, and init_postroot hooks are called
at the same places as described in the section on CLMS subsystems.
If your key service initialization function doesn't require sending
messages/RPCs, hooking it into init_prejoin() should work fine.
Otherwise, hook the initialization function into init_postjoin(),
after communication paths have been setup with the key server nodes.
Writing the Makefile
=====================
The Makefile for your new cluster service will include targets for a
merged .o file (O_TARGET), a list of individual file objects (obj-y), and
a list of your service definitions files (SVCFILES). You will also
need to include the cluster/Rules.make file to process the service definitions
files. Here's an example Makefile for a cluster service named test:
O_TARGET := test.o
obj-y := \
test_init.o \
test_client.o \
test_server.o \
test_misc.o
SVCFILES := \
test.svc
include $(CLUSTERDIR)/Rules.make