OPEN CI and OPEN SSI CLUSTER PROJECTS

1. ICS Messages and Responses
2. Generating ICS Code Using icsgen
    2.1. Services
    2.2. Operations
    2.3. Datatypes and Encoding
    2.4. Service Invocation Macros
    2.5. Summary Format of Generated Client-side Stubs
    2.6. Summary Format of Generated Server-side Stubs
    2.7. Generated Prototypes
    2.8. Generated Dispatch Table and Service Registration Routine
    2.9. Interface Examples
    2.10. icsgen
        2.10.1. Syntax
        2.10.2. Description
        2.10.3. Options
        2.10.4. Files
        2.10.5. Examples
    2.11. Source and Build Environment
    2.12. Defining Encoding and Decoding
    2.13. XDR-defined Datatypes
    2.14. nsc_rcall() Emulation
    2.15. Verification Tests
3. High-level ICS
    3.1. CI/SSI Services and ICS Communication Channels
    3.2. ICS Server Control Code
    3.3. Typical ICS Client Usage Scenarios
    3.4. Typical ICS Server Usage Scenarios
    3.5. ICS General Purpose Routines
        3.5.1. ics_svc_register()
        3.5.2. ics_init()
        3.5.3. ics_nodeinfo_callback()
        3.5.4. ics_geticsinfo()
        3.5.5. ics_seticsinfo()
        3.5.6. ics_nodeup()
        3.5.7. ics_nodedown()
        3.5.8. ics_getpriority()
        3.5.9. ics_setpriority()
    3.6. ICS Client-side Routines
        3.6.1. icscli_handle_get()
        3.6.2. icscli_wouldthrottle()
        3.6.3. icscli_waitforthrottle()
        3.6.4. icscli_send()
        3.6.5. icscli_wait_callback()
        3.6.6. icscli_wait()
        3.6.7. icscli_get_status()
        3.6.8. icscli_handle_release()
        3.6.9. icscli_handle_release_callback()
    3.7. ICS Client-side Argument Marshalling
        3.7.1. icscli_??code_type()
        3.7.2. icscli_??code*_data_t()
        3.7.3. icscli_??code*_uio_t()
        3.7.4. icscli_??code_mbuf_t()
        3.7.5. icscli_??code_mblk_t()
    3.8. ICS Server-side Routines
        3.8.1. icssvr_handle_get()
        3.8.2. icssvr_recv()
        3.8.3. icssvr_decode_done()
        3.8.4. icssvr_reply()
        3.8.5. icssvr_handle_release()
        3.8.6. icssvr_nodedown_svc_wait()
    3.9. ICS Server-side Argument Marshalling
        3.9.1. icssvr_??code_type()
        3.9.2. icssvr_??code_data_t()
        3.9.3. icssvr_??code_uio_t()
        3.9.4. icssvr_??code_mbuf_t()
        3.9.5. icssvr_??code_mblk_t()
4. Low-level ICS
    4.1. Low-level ICS and Communication Channels
    4.2. Low-level ICS Typical Client Scenarios
    4.3. Low-level ICS Typical Server Scenarios
    4.4. Low-level ICS General Purpose Upcall Routines
        4.4.1. ics_nodedown_notification()
        4.4.2. ics_nodeup_notification()
    4.5. Low-level ICS General Purpose Routines
        4.5.1. ics_llinit()
        4.5.2. ics_llgeticsinfo()
        4.5.3. ics_llseticsinfo()
        4.5.4. ics_llnodeup()
        4.5.5. ics_llnodedown()
    4.6. Low-level ICS Client-side Upcall Routines
        4.6.1. icscli_find_transid_handle()
        4.6.2. icscli_sendup_reply()
    4.7. Low-level ICS Client-side Routines
        4.7.1. icscli_llhandle_init()
        4.7.2. icscli_llwouldthrottle()
        4.7.3. icscli_llwaitfornothrottle()
        4.7.4. icscli_llsend()
        4.7.5. icscli_llhandle_deinit()
    4.8. Low-level ICS Client-side Argument Marshalling
    4.9. Low-level ICS Server-side Upcall Routines
        4.9.1. icssvr_find_recv_handle()
        4.9.2. icssvr_sendup_msg()
        4.9.3. icssvr_sendup_replydone()
    4.10. Low-level ICS Server-side Routines
        4.10.1. icssvr_llhandle_init()
        4.10.2. icssvr_llhandle_init_for_recv()
        4.10.3. icssvr_llhandles_present()
        4.10.4. icssvr_lldecode_done()
        4.10.5. icssvr_llreply()
        4.10.6. icssvr_llhandle_deinit()
    4.11. Low-level ICS Server-side Argument Marshalling Routines

SUBJECT:

AUTHOR:

VERSION:

DATE:

Internode Communication Subsystem (ICS) Design Specification

Now maintained by Compaq Computer Corp.

1.0

Originally 1996; See Errata From 10/15/01 for Updates

This document describes ICS, the CI/SSI Internode Communication Subsystem. ICS is used by all CI/SSI components for node-to-node communication.

The first section of this document provides a high-level overview of the message/response paradigm presented by ICS.

The second section of this document describes icsgen, the tool used to generate RPC and message client/server stubs.

The third section of this document describes the ICS interfaces (these interfaces are also known as high-level ICS interfaces, to contrast them with the low-level ICS interfaces described in the next section).

The final section of this document describes the low-level ICS interfaces. This is the transport-specific part of ICS (though the interfaces to low-level ICS are transport-independent). The low-level ICS interfaces are invoked only by high-level ICS.

1. ICS Messages and Responses

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- ICS_MAX_INLINE_DATA_SIZE is set to 300 bytes.

- There are a limit of 7 out-of-line buffers, where the buffers are

unlimited in size. "Out-of-line" is misleading, since TCP doesn't

offer true OOL data. It was a feature provided by a special

high-speed interconnect such as Servernet. Over TCP, OOL data is

sent out regularly like inline data. All OOL data is simply

appended to the end of the inline data buffer.

- DMA-push/DMA-pull is not supported. This was a feature when using

a high-speed interconnect such as Servernet. It is not available

over regular TCP/IP.

***********************************************************

The Internode Communication Subsystem (ICS) of CI/SSI has the following characteristics:

a client/server model is presented;
the high-level ICS client code is presented with either a message/response
model or a reliable message passing model;
the high-level ICS server code retrieves messages and sends optional responses back to the client.

This model can be used to efficiently represent both types of messaging paradigms: messages and RPC's (Remote Procedure Calls).

ICS messages and responses consist of:

a fixed size, in-line buffer (ICS_MAX_INLINE_DATA_SIZE bytes);
zero or more out-of-line buffers, where the buffers are unlimited in size.

The ICS interfaces are structured in a way to allow low-level ICS code to implement efficient transfer of out-of-line buffers using DMA-push, DMA-pull, or other techniques.

This chapter often refers to input arguments and output arguments . Input arguments refer to the data in messages to the server (both in-line data and out-of-line data). Input arguments apply to both RPCs and messages. Output arguments refer to the data in a response sent from a server (both in-line data and out-of-line data). Output arguments apply to RPCs only.

In order for client code to send a message (and wait for a response), or for the server code to handle a message from the client, the code must deal with handles. Handles are data structures which are opaque to the icsgen-generated stub code. Handles have the following characteristics:

contain the state of the transaction (the number of bytes in the message, whether the message has been sent, etc.);
contain the fixed size, in-line buffer referenced in prior paragraphs; contain information about any out-of-line buffers that are in use;

Many of routines in the interface make use of callback routines. Callback routines are routines that will be called by the low-level ICS code when some specific event has transpired. The parameters to the callback routine are specified in the relevant interface specifications. Of particular note is the fact that each callback routine has an argument specified as callarg. The callarg argument is an argument to the interface which is supplied solely for the purpose of being used as a callback argument.

2. Generating ICS Code Using icsgen

At the highest level, the interface between ICS and other CI/SSI components (using ICS) is a set of services with each service supporting a set of well-defined operations. These services - comprising both client and server elements - constitute the interface with the ICS subsystem. Services are defined using a definition language which is processed at build-time producing code to bind the local CI/SSI component to ICS; ICS client to ICS server; and thus local CI/SSI component to remote service.

Service definitions are contained in source files processed by a build process called icsgen. icsgen reads service definition files and generates further source files containing:

header files of prototype declarations for functions to be supplied to ICS - these prototypes ensure a compatible interface with ICS;
header files containing a macro interface calling ICS-supported services;
stub routines that map operations between layers of ICS and between nodes - these are principally routines mapping data representations between CI/SSI and internal formats used by ICS;
server dispatch tables used internally by ICS.

2.1. Services

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- mem_lwm and mem_hwm are currently not being used, since flow

control (aka throttling) is not supported.

***********************************************************

Each distinct service supported by ICS must be declared to ICS. Each service is defined by a file (or set of files) identifying the service to ICS and defining the operations (functions) provided by the service. Service definition files are conventionally given the name service_name.svc.

A service is defined to ICS by a service prefix. The syntax declaring this is:

	service svc_name svc_id hmin hmax [mem_lwm mem_hwm]

where

svc_name

is an identification for the service. It is used in constructing the names of data structures defining the service. Typically the prefix is the same or similar to the .svc filename - for example, the service defined by file fbtok.svc has service name fbtok (for remote file block token operations).

svc_id

is the unique number (or symbol representing this) of the ICS service.

hmin is the minimum number of free ICS server handles held in reserve for this service.

hmax

is the maximum number of free ICS server handles held in reserve for this service.

mem_lwm

is the number of free memory pages below which server-side flow-control (also called throttling) is activated. If not specified, this value is assigned a default at run-time.

mem_hwm

is the number of free memory pages above which server-side flow-control ceases if active. If not specified, this value is assigned a default at run-time.

Service ids are defined in header file <cluster/ics.h> which maps component services to ICS communication channels. Note that many CI/SSI services may be grouped together as sub-services sharing ICS communication channels and flow-controlled as a unit (see the discussion on communication channels and throttling in the high-level ICS section of the document).

The form of the operations definitions are described in the next section.

2.2. Operations

Each operation provided by an ICS-supported service must be fully defined. For each operation, parameters and their attributes must be declared. This both defines the interface with ICS and also dictates how ICS handles data transferred between nodes.

The syntax defining an operation is:

	operation op_name [op_attr] {

		param param_direction[:param_attr] param_utype param_name

		... repeated for each parameter...

where

op_name

is the name of the operation, for example: pvpop_fork;

op_attr is one of the following, defaulting to RPC:

RPC
the operation is a remote procedure call for which at least a returned status value will be awaited.

MSG
the operation is a simple message with no response expected

ASYC

the operation is an RPC but the caller requests that send and receive
sub-operations will be separately callable allowing non-blocking behavior.

SKIP

the operation is a dummy placeholder; it enables operation entry numbers to be reserved at a lower-level with no user-level interface corresponding to them. Note that the op_name field must be given.

In addition, there are the following modifiers (which can be concatenated to an operation attribute using the syntax op_attr:op_modifier):

NO_SIG_FORWARD requests that an RPC is not to be interrupted to forward signals on the server.

NO_BLOCK declares that a MSG operation may be called from interrupt level and that the client stub should not sleep awaiting system resources but should return with error (EAGAIN).

param_direction is one of the following:

IN for an input paramter;

OUT for an output parameter; or

INOUT for a parameter that is both input and output.

param_attr is a composite declaring the data attributes of a parameter, where this is a concatenation of the following separated conventionally by ":" (although any character is accepted):

OOL
declares the parameter to be out-of-line data.

XDR
declares the parameter is of complex datatype which has been defined as an XDR-able structure and is to be encoded/decoded using an xdr-routine (either generated by rpcgen or hand-written).

VAR
declares the parameter to be of variable length with the length determined by the following implicitly-generated parameter. That is, the caller of this operation supplies an additional argument specifying the length of the variable parameter. This length is generally the number of basic elements in the supplied array: this is the number of bytes for a character array but the number of elements in an array of more complex type. See below for an example of this.

PTR
declares that an input parameter being passed by reference is a a non-scalar type, which is not equivalent to intgeger type, and this pointer may be NULL. Parameters referenced by pointer must always be non-NULL - unless they have the PTR, VAR or OOL attribute.

FREE declares that an input parameter requires an addition call to free resources associated with it on the server-side after the remote server operation has been completed.

param_utype is the user type of this parameter, for example: struct vproc or cred_t.

param_name is the name of the parameter. Note that the name may be prefixed by "*"s indicating indirection; these are implicitly suffixed to param_utype.

In addition to these declared parameters, extra parameters are implicit for operations. If these were to be explicitly declared, these would occur as the first parameters to each operation and would take the following forms. For non-MSG operations returning a value:

param IN node_t node
param OUT int *rval

and for MSG operations not returning any value:

param IN node_t node

where:

node is the node number of the destination node on which the operation is targeted; and

rval is the return value of the non-message operation.

Variable length parameters declared with attribute VAR have an associated length parameter which is implicitly declared as the next parameter. Thus, a declaration of the form:

param IN:VAR char name[MAX_NAME_LEN]

will be handled by ICS as if the following were declared:

param IN:VAR char name[MAX_NAME_LEN]

param IN int name_len

This also illustrates the way that the maximum size of a variable length parameter is specified. The main purpose of this is to allocate buffer space into which data is decoded on the server.

icsgen pre-processes service definitions using the C pre-processor. Hence, service definitions can contain embedded C pre-processor directives, macro invocations, and, of course, C-style comments. `

2.3. Datatypes and Encoding

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- cli_encoderesp_ool_param_type() is a NO-OP. This was used to

support DMA-push/DMA-pull over the Servernet interconnect.

- There are no encode/decode routines for uio_t, mblk_t and mbuf_t,

as they are not supported in Linux. Instead, there are new routines

for encoding/decoding msghdr structures.

***********************************************************

Each parameter must have its data type fully defined. If the user data type, param_utype does not quote a C typedef - for example, it is a structure definition like struct procinfo - then a type will be generated by the replacement of whitespace by "_" characters forming, in this example, struct_procinfo.

In addition, indirection indicated through the use of a "*" appended to param_utype will result in _p being appended to the fully qualified parameter type, param_type. So, for example, struct procinfo * becomes struct_procinfo_p. Note that initially only a single level of indirection will be supported. This resulting type dictates the encoding an decoding steps which ICS will use to handle this parameter. For each type, the following routines (or macros) must be callable to ICS:

cli_encode_[ptr_][var_][ool_]param_type() to marshal an input parameter into an ICS request buffer on the client-side;
cli_decode_[ptr_][var_][ool_]param_type() to demarshal an output parameter from an ICS reply buffer on the client-side;
svr_encode_[ptr_][var_][ool_]param_type() to demarshal an input parameter from an ICS request buffer on the server-side;
svr_decode_[ptr_][var_][ool_]param_type() to marshal an output parameter into an ICS reply buffer on the server-side.

Here, the presence of ool_ in the routine name indicates that this parameter is being passed out-of-line and hence needs special handling. Likewise, var_ indicates a variable length parameter and ptr_ indicates a possibly null pointer parameter. OOL parameters also require:

cli_encoderesp_ool_param_type() to prepare for an OOL output parameter on the client-side prior to sending the request.

Encoding and decoding routines are provided by ICS for the following base datatypes for non-OOL data:

int (long)
char
char_array

and the following for OOL data:

data_t
uio_t
mblk_t

Marshalling for other commonly used kernel datatypes is defined by header file <cluster/ics_proto.h> and source file <cluster/util/nsc_ics.c>. Routines for all other datatypes must be supplied by individual CI/SSI component using ICS.

2.4. Service Invocation Macros

Calling macros are generated for each of the operations in a service. These generated macros invoke the corresponding ICS client stub routine. The form of the calling macro for RPC operations is:

	OP_NAME(node,rval,argument_list)

where:

OP_NAME

is the name of the operation (as it appears in the operation definition, without prefix and in upper-case);

node

is the node number identifying a remote node on which the operation is to be performed;

rval

is a variable to be assigned the return value of the remote operation;

argument_list

is the argument list (excluding destination node) for the operation.

For example, the service invocation macros for the remote PVP service will be generated so that invocation of RPVPOP_SETSID() results in a calls to the ICS client stub routine named cli_rpvpop_setsid().

Services including remote asynchronous operations also generate invocation macros for the send and receive operations. That is,

                OP_NAME_SEND(node,handle,argument_list)

                OP_NAME_RECEIVE(handle,rval,argument_list)

where:

handle

is a message handle returned by _SEND and supplied to _RECEIVE.

Note that handle is returned as the second parameter to the _SEND call and the same value must be supplied as the first parameter to the _RECEIVE call. To the caller, handle is of type cli_handle_t ** for the send call and cli_handle_t * for the _RECEIVE call. Note also that the _RECEIVE macro is not provided with the destination node and it returns rval.

Remote operation macros return an integer value indicating the success (or otherwise) of the attempt to perform the remote operation. If a non-zero value is returned, then neither rval nor any output parameter can be assumed valid.

For MSG operations with no value returned to the caller of ICS, rval is absent from the calling macro, which is thus in the form:

                OP_NAME_MSG(node,argument_list)

Note that operations defined as either MSG or ASYNC can be called in the normally synchronous RPC fashion since this form of client stub code is always generated and the server-side code behaves identically irrespective of the manner of client-side call. In particular, this enables the caller to wait for confirmation that a message operation has been performed remotely.

2.5. Summary Format of Generated Client-side Stubs

The basic form of the client-side data marshaling stub is:

int

	cli_<op_name>(

		node_t                  node,

		int                     *rval,

		<in_param_type1>        <in_param_name1>,

		<inout_param_type1>     <inout_param_name1>,

		<out_param_type1>       <out_param_name1>,

...

		<Local variables generated here>

		cli_handle_t *handle = icscli_handle_get(node, <service_num>, TRUE);

/*

		 * Marshal input arguments into transport buffer

*/

		cli_encode_<in_param_type1>(handle,<in_param_name1>,...);

		cli_encode_<inout_param_type1>(handle,<inout_param_name1>,...);

		/* ... as appropriate for all input parameters to be encoded ... */

		cli_stat = icscli_send(handle,...);     /* send */

		cli_stat = icscli_wait(...);            /* wait for reply */

/*

		 * Decode the result code for the remote operation

*/

		cli_decode_int_p(handle,rval);

/*

		 * Marshal output arguments from transport buffer

*/

		cli_decode_<inout_param_type1>(handle,<in_param_name1>,...);

		cli_decode_<out_param_type1>(handle,<inout_param_name1>,...);

		/* ... as appropriate for all output parameters to be decoded ... */

	out:

		icscli_handle_release(handle);

		return (cli_stat);

This pattern is complicated somewhat by additional generated code to handle variable length and OOL datatypes, and by asynchronous versions of the operation.

2.6. Summary Format of Generated Server-side Stubs

int

	svr_<op_name>(svr_handle_t *handle)

		int rval;

		<Operation arguments generated as local (stack) variables here>

/*

		 * Marshal input arguments from transport buffer onto stack

*/

		svr_decode_<in_param_type1>(handle,<in_param_name1>,...);

		svr_decode_<inout_param_type1>(handle,<inout_param_name1>,...);

		/* ... as appropriate for all input parameters to be encoded ... */

		icssvr_decode_done(handle);

		<op_name>(this_node, &rval, <argument_list>);

/*

		 * Encode the result of the operation

*/

		svr_encode_int(handle, rval);

/*

		 * Marshal output arguments from stack into transport buffer

*/

		svr_encode_<inout_param_type1>(handle,<in_param_name1>,...);

		svr_encode_<out_param_type1>(handle,<inout_param_name1>,...);

		/* ... as appropriate for all output parameters to be decoded ... */

/*

		 * If an error occured while encoding output params, reset the

		 * reply message and encode only the failure code.

*/

	out:

		if (svr_stat) {

			icssvr_rewind(handle);

			svr_encode_int(handle,svr_stat);

		return(0);	
	}

Stack variables are allocated for parameters in the following manner:

IN parameters passed by value have buffer space for the demarshaled parameter value reserved.
IN parameters pass by reference have a pointer variable allocated containing the address of a stack variable into which to demarshal the parameter value.
OUT parameters must be passed by reference. These have a pointer variable allocated containing the address of a stack variable into which the value is output and from which the returned value is encoded.
IN:OOL parameters must be passed by reference. A pointer variable with NULL value is allocated.
OUT:OOL parameters must be passed by double indirection. A pointer variable is allocated pointing to a NULL-valued pointer to the underlying OOL datatype; the output pointer being returned into this second pointer.

2.7. Generated Prototypes

For each service, header files are generated containing ANSI C function prototype declarations for all stubs and server functions. Inclusion of these header files in CI/SSI components using ICS ensures compatibility of the calling interface with ICS. The generated stubs also include these prototypes for self-consistency.

2.8. Generated Dispatch Table and Service Registration Routine

Additionally, a server dispatch table and a service registration routine is constructed for each service. This table enables the ICS server message handler to call the server function corresponding to the requested operation on the remote node. The form of the table is:

	struct {

		int (*<op_name1>)(svr_handle_t *hndl);

		int (*<op_name2>)(svr_handle_t *hndl);

		/* ... repeated for each operation */

		int (*<op_namen>)(svr_handle_t *hndl)

	} icssvr_<svc_name>_stubs = {

		svr_<op_name1>,

		svr_<op_name2>,

		/* ... repeated for each operation */

		svr_<op_namen>

The prototype for the service registration routine is:

	extern int <svc_name>_svc_init(void);

This routine must be called before ICS will service operation requests for the service.

2.9. Interface Examples

This section illustrates the interface of ICS with CI/SSI components (the service definer, provider and caller) in the form of a simple example. It shows how a service and its constituent operations are defined and what routines must be provided for ICS to call.

Suppose we have a service named foo having an RPC operation named bar. Service definition file foo.svc contains:

	service	foo NSC_FOO_SVC 1 2

	operation rfoo_bar {

		parameter IN:VAR char arg[MAX_ARGLEN]

		parameter OUT result_t *result

Functions must be provided to invoke and to service this operation in the following form:

	#include <ics_foo_protos_gen.h>	/* generated prototypes */

int

	foo_bar(node_t node, char *arg, result_t *result)

		if (node != this_node) {

/*

			 * Call ICS to perform this operation on remote node.

*/

			int rval;

			int cli_stat;

			cli_stat =  RFOO_BAR(node, &rval, arg, strlen(arg), result);

			if (cli_stat != 0)

				panic("ICS remote operation failed");

			return(rval);

		*result = do_bar(arg);

		result(0);

/*

	 * This routine is called to service a remote RFOO_BAR() request.

*/

int

	rfoo_bar(node_t node, int *rval, char_t arg, int arg_len, result_t *result)

		*rval = foo_bar(node, arg, arg_len, result);

		return(0);

If instead, this operation is declared asynchronous, viz:

	service	foo NSC_FOO_SVC 1 2

	operation rfoo_bar ASYNC {

		parameter IN:VAR char arg[MAX_ARGLEN]

		parameter OUT result_t *result

then the invoking function might take the following form:

	#include <ics_foo_protos_gen.h>		/* generated prototypes */

int

	foo_bar(node_t node, char *arg, result_t *result)

		if (node != this_node) {

/*

			 * Call ICS to perform this operation on remote node.

*/

			int rval;

			int cli_stat;

			cli_handle *hndl;

			cli _stat = RFOO_BAR_SEND(node, &hndl, arg, strlen(arg), result);

			if (cli_stat != 0)

				panic("ICS remote operation failed");

/*

			 * While this is happening, let's do something else...

*/

			do_be_do();

/*

			 * Now wait for the remote operation to complete.

*/

			cli_stat = RFOO_BAR_RECEIVE(hndl, &rval, arg, strlen(arg), result);

			if (cli_stat != 0)

				panic("ICS remote operation failed");

			return(rval);

		*result = do_bar(arg);

		return(0);

Note that the service routine is unchanged since it is unaware whether the client is making the request synchronously or asynchronously.

Finally, removing the output parameter from operation bar, we could redefine this operation as a message:

	service	foo NSC_FOO_SVC 1 2

	operation bar MSG {

		parameter IN:VAR char arg[MAX_ARGLEN]

then the client and server functions could take following form:

	#include <ics_foo_protos_gen.h>		/* generated prototypes */

int

	foo_bar(node_t node, char *arg)

		if (node != this_node) {

/*

			 * Call ICS to request this operation on remote node.

			 * But we don't wait around for any confirmation or reply.

*/

			int cli_stat;

			cli_stat = RFOO_BAR_MSG(node, arg, strlen(arg));

			if (cli_stat != 0)

				panic("ICS remote operation failed");

			return(0);

		do_bar(arg);

		return(0);

/*

	 * This routine is called to service a remote RFOO_BAR() request.

*/

int

	rfoo_bar(node_t node, char_t arg, int arg_len)

		(void) foo_bar(node, arg, arg_len);

		return(0);

2.10. `icsgen`

The icsgen build phase constructs the following components for defined operations:

1.

2.

3.

4.

5.

Macros to invoke the remote operations.

Prototype declarations for the remote operation routines.

Client interface stubs for remote operations.

Server interface stubs for remote operations.

Dispatch tables used internally by ICS.

Files are generated by icsgen containing the required components. Generated files use a file naming scheme in which the suffix _gen precedes the file extension, e.g. ics_vproc_protos_gen.h.

The build process to generate the required output text files is driven by make commands

using a master script called icsgen. icsgen is defined as follows:

2.10.1. Syntax

	icsgen [[-a] | [-c] [-m] [-p] [-s] [-t]] \

	       [-D def] [-U undef] [-n name] [-o outfile] svc_files

2.10.2. Description

icsgen derives output text selected by option switch from a given service definition file, svc_file. If no svc_file is specified, standard input is read. Output is written to a file bearing the default name of the requested transformation or to a specified outfile. Output text is selected by the following options:

-a generate all the following (this is the default);

-c generate client-side stubs;

-m generates invocation macros;

-p generates prototype declarations for stubs and server routines;

-s generates server-side stubs;

-t generates server dispatch table;

The input is read and passed through the C preprocessor before conducting the required transformation.

2.10.3. Options

-D preprocessor #define;

-U preprocessor #undefine;

-o specifies outfile filename.

2.10.4. Files

The default output files created by icsgen are as follows:

icscli_svc_gen.h - client-side stubs;
ics_svc_macros_gen.h - invocation macros;
ics_svc_protos_gen.h - prototype declarations;
icssvr_svc_gen.c - server-side stubs;
icssvr_svc_tables.c - server dispatch table and service registration routine.

where svc is the service name declared in the service definition file.

2.10.5. Examples

	icsgen -p pvp.svc as.svc

Generates vproc operations prototype file ics_vproc_protos_gen.h from service master files pvp.svc and as.svc.

	icsgen -c pvps.svc > icscli_pvps_gen.c

Generates client operation stubs for private virtual system operations defined in file pvps.svc and these are written to file icscli_pvps_gen.c.

2.11. Source and Build Environment

For a CI/SSI component using ICS to provide transport for a service, in addition to the service operations routines, the following files must be provided:

A .svc file defining the service, its operations and their parameters. For services employing the emulated nsc_rcall() interface, an RPCGEN .x file will be present and ICS service declarations may be embedded in this rather than being placed in a separate .svc file.
A file containing invocation of remote services via OP_NAME() macros. Typically, individual operations routines include local code and remote invocations.
A file containing the remote server routines op_name() which are called by the ICS server stubs routines. These routines may be included in the same file as the local opertaions routines, or they may be placed in a separate source file with name such as svc_name_server.c.
A source file to (pre-processor) include the generated ICS client and server stubs and the server tables and registration routine. This file must also include definition of encode/decode macros for special datatypes used by the service.

The nature of the encode/decode macros is discussed further below.

2.12. Defining Encoding and Decoding

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- cli_encoderesp_ool_macro is a NO-OP. This was used to support

DMA-push/DMA-pull over the Servernet interconnect.

***********************************************************

Generated ICS client and server stubs contain encode and decode calls to marshal parameters of specified datatype. Each invocation may be a call to a routine to perform the required marshaling or it may be a invocation of a macro to do this. However, the macro form is the primary interface and in some cases (when a parameter name needs conversion to its address) a macro will be required to effect a correct entry to a routine. Marshaling for a number of commonly used datatypes is defined as macros in CI/SSI header file <cluster/ics_proto.h>. Marshaling for other datatypes must be defined on a component by component basis.

Note that marshaling required for a particular datatype depends on the type and attributes of the parameter.

IN parameters require cli_encode_ and svr_decode_ macros.
OUT parameters require svr_encode_ and cli_decode_ macros.
INOUT parameters require both of the above.
IN:FREE parameters require an additional svr_free_ macro.
OUT:OOL parameters require an additional cli_encoderesp_ool_ macro.
IN:OOL parameters require an additional svr_free_ool_ macro.

Encode/decode macros for non-OOL and non-VAR types take 2 arguments the handle and the parameter name. OOL and VAR parameter macros additionally take a length parameter.

All OOL encode and decode macros can fail and return a non-zero status.

Whether passed by reference or value, non-OOL parameters will have space reserved for them on the server's stack and svr_decode_ macros use this for demarshaling input values and marshal output values.

OOL parameters require kernel memory to be allocated to receive the transferred data. It is the responsibility of the encode/decode macros to make this allocation and to ensure that the space is freed when appropriate after use. On the server-side, stack space is allocated for a pointer to the OOL datatype.

2.13. XDR-defined Datatypes

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- This section has been implemented, although not extensively tested.

(In other words, use at your own risk!).

- ICS marshalls all xdr parameters into iovecs pointed to by a msghdr

struct. This msghdr struct is then encoded/decoded using the ICS

routines:

icscli_llencode_ool_struct_msghdr
icscli_lldecode_ool_struct_msghdr

***********************************************************

Special handling is provided for complex datatypes involving trees defined in an xdr-able form (by .x files). Serialization (encoding) and deserialization (decoding) is performed through xdr routines generated by rpcgen. With ICS, this process involves construction of encoded mblk chains which are transferred out-of-line to the target node, decoded and freed. The ICS marshaling code arranges for one mblk chain to be dedicated to each parameter exchanged in this way. The ICS routines icscli_encode_ool_mblk_t(), icscli_decode_ool_mblk_t(), icssvr_encode_ool_mblk_t(), icssvr_decode_ool_mblk_t() are called to perform the transfer of the mblk chains.

To use XDR marshaling, encode/decode macros must call a special handling routine and quote the associated XDR marshaling routine to be used. On the encode side this routine is icscli_encode_xdr() or icssvr_encode_xdr() and on the decode side it is icscli_decode_xdr() or icssvr_decode_xdr(). Also, a generic free routine icssvr_free_xdr() is available for server-side use to free IN parameters after the remote operation has completed. These routines are called with the following parameters:

ICS handle associated with the call;
address of the data to be coded;
XDR routine to marshal the data type.

XDR routines are generated from a conventional .x file by rpcgen or they may be hand-coded. The data address given follows the usual conventions for XDR - in particular, a NULL address will cause memory to be allocated automatically on decode.

Note that using XDR involves a data copy for both encode and decode phases and therefore should be avoided for large data sizes if standard ICS out-of-line support can be exploited for greater efficiency.

2.14. `nsc_rcall()` Emulation

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- This section has been implemented, although not extensively tested.

(In other words, use at your own risk!).

- .x files are generated using rpcgen.

- The icsnsc_rcall macro function is obsoleted. Use nsc_rcall()

directly.

- Credentials are not yet supported.

- Data is marshalled/demarshalled in a msghdr struct.

- RPC client stub for nsc_rcall service is called:

cli_icsnsc_rcall()

- MSG client stub for nsc_rcall service is called:

cli_icsnsc_rcall_msg()

- Server stub for nsc_rcall service is called:

svr_icsnsc_rcall()

***********************************************************

As a compatibility and code migration path from the former nscrpc subsystem, the nsc_call() interface is emulated by ICS. This enables non-ICS style CI/SSI components - with RPC interfaces defined by .x file, compiled by rpcgen and employing the ONC-style RPC paradigm - to use ICS with minimal change to the RPC-calling code.

This emulation is conducted above the ICS data marshaling layer. Firstly, client calls to nsc_rcall() are redirected to an ICS routine, icsnsc_rcall(), by a #define of the form:

	#define tnc_rcall(h,proc, xargs, args, xresults, resultsp,flag) \

	    icstnc_rcall(h,proc,xargs,args,xresults,resultsp,flag,  \

		         cli_<svc>_rcall,cli_<svc>_rcall_noreply_msg)

The addtional parameters to icsnsc_rcall() indicate special ICS client stub routines. icsnsc_rcall() then:

marshals argument data into mblk's using the specified XDR routine,
obtains credentials information for the caller,
calls the appropriate ICS client stub according to whether a reply is expected,
demarshalls any results data from returning mblk's which are freed.

The special client stubs correspond to two special ICS operations (an RPC and a message) defined for the service as follows:

	operation <svc>_rcall {

        	param IN:PTR:FREE cred_t *credp    /* client credentials */

		param IN int             procnum   /* operatioin proc number */

		param IN:OOL mblk_t      *margsp   /* mblks encoded with args */

		param OUT:OOL mblk_t     **mrespp  /* mblks encoded with results */

	operation <svc>_rcall_noreply MSG {

		param IN:PTR:FREE cred_t  *credp   /* client credentials */

		param IN int              procnum  /* operation proc number */

		param IN:OOL mblk_t       *margsp  /* mblks encoded with args */

On the server-side, the corresponding ICS server stubs are called. These direct the requests to a common rcall dispatching routine, icsnsc_rpc_dispatch(), along with the server routine table entry to be dispatched. The mapping from server stub to this dispatcher is performed by further a macro of the form:

	extern int <svc>_0_nproc;

	extern nsc_rpccall_t nsc_rpc_<svc>_0[];

	#define <svc>_rcall(node,rval,redp,procnum,margsp,mrespp)    \

		ASSERT(node == this_node);                           \

		ASSERT(0 <= procnum && procnum < <svc>_0_nproc);     \

		*rval = icsnsc_rpc_dispatch(

			credp,&nsc_rpc_<svc>_0[procnum],margsp,mrespp)

/*

	 * Map the message-handling server stub to themessage version,

	 * but with a NULL result pointer.

*/

	#define <svc>_rcall_noreply(node,redp,procnum,margsp)     \

		ASSERT(node == this_node);                        \

		ASSERT(0 <= procnum && procnum < <svc>_0_nproc);  \

		(void) icsnsc_rpc_dispatch(

			credp,&nsc_rpc_<svc>_0[procnum],margsp,NULL)

icsnsc_rpc_dispatch() then proceeds to:

demarshal argument data from incoming mblk's by calling the corresponding XDR routine and then freeing the mblk's,
switch to the client's credentials,
call the corresponding server function,
revert to the server's credentials,
if results are to be returned, XDR these into outgoing mblk's,
free the results data if there's a corresponding XDR free function defined in the rcall table.

The above lines of macros must be inserted (prefixed by %'s) into the .x file. Additionally, the ICS service definition is placed in the .x file surrounded by pre-processor conditions which render these visible to the icsgen and not to rpcgen. Finally, other pre-processor directives are added to include ICS-generated stubs at the appropriate place. Thus, by suitable .x and makefile changes alone, a CI/SSI subsystem can be converted to use the ICS nsc_rcall() emulator with no further code change. Moreover, the conversion is achieved as a build option.

2.15. Verification Tests

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- Not implemented.

***********************************************************

Under the TNC_DEBUG build option, a series of tests of ICS operation is available through the tncsys() system call. At the command-line level these tests can be invoked by the command:

	node -I nodenum

This verifies a range of data type marshaling/de-marshaling, data transmission, and operation attributes (RPC, message, asynchronous) from the node of the calling process to a nominated second node.

Any test failure is logged to the console of the node detecting the error. Subsequent tests are continued, however. One or more test error being returned to the invoking process.

The source for these tests is to be found in file tnc/tnc_icstest.c and the ICS test operation are defined in file tnc/tnc_icstest.svc.

The test sequence is as follows:

1.

Test primitive types and basic RPC support.

a. Scalar integer values.
IN, OUT, and INOUT integer values are exchanged. This verifies the
basic operation of the ICS subsystem, the generated ICS client and
server stubs and the interface with the ICS subsystem: in particular,
routines icscli_handle_get(),
icscli_encode_int32(), icscli_decode_int32(),
icssvr_decode_int32(), icssvr_encode_int32(),
icscli_send(), and icscli_handle_release().

b. Variable-length integer arrays.
IN, OUT, and INOUTVAR integer arrays are exchanged. This verifies
the generation of ICS stubs for these types and of ICS interface
routines: icscli_encode_int32_array_t(),
icscli_decode_int32_array_t(),
icssvr_decode_int32_array_t(), and
icssvr_encode_int32_array_t().

c. Fixed and variable length character arrays.
IN, OUT, and INOUT fixed-length char arrays, and IN and INOUT VAR arrays are exchanged. This verifies the generation of ICS stubs for
these types and of ICS interface routines:
icscli_encode_char_array_t(), icscli_decode_char_array_t(), icssvr_decode_char_array_t(),icssvr_encode_char_array_t(). Also, the use of ICS marshaling macros is verified for the handling of fixed-length data stuctures (in this case a simple character array).

2. Out-of-line data.
IN, OUT, and INOUT fixed-length char arrays, and IN and INOUTVAR arrays are exchanged as out-of-line data. This verifies the generation of ICS stubs for these types and of ICS interface routines: icscli_encode_ool_data_t(),
icscli_encoderesp_ool_data_t(),
icscli_decode_ool_data_t(),
icssvr_decode_ool_data_t(), and icssvr_encode_ool_data_t(). In addition, ICS marshaling support routines to allocate and free memory are exercised.

3.
Special data marshaling.
This series of tests verifies special-purpose datatype handling provided at the ICS stub level.

a. PTR and FREE parameter attributes.
PTR parameters are potentially NULL pointers to IN data structures.
The FREE attribute requests that the server stub call a free routine for a
datatype after the remote procedure returns. This test verifies that
non-NULL and NULL pointers are correctly marshalled and that the
associated free routine is called.

b. uio and iovec marshaling.
This test verifies that a combined uio/iovec structure pair is correctly
marshaled to and from a remote node. Note that such data structures are
always INOUT.

c. XDR and credentials type marshaling.
This test exercises several capabilities:

� the use of XDR marshaling routines to transfer data structure in
the form of out-of-line encoded mblks;

� ICS credentials marshaling macros for the exchange of credentials
information (including NULL and system credentials) as ICS
in-line data; and

� the equivalence of credentials marshaled in either of these ways.

This tests generated stub code and the support routines for interfacing
with standard XDR marshaling routines and the production and
consumption of data-encoded mblks.

4.
Asynchronous capabilities (ASYNC operation attribute).
In this test, an ASYNC operation is first preformed synchronously as a conventional RPC and then _SEND and _RECEIVE portions are performed separately with a one second delay between.

This is a test of ICS stub generation only; the ICS subsystem does not distinguish between synchronous and split asynchronous operation.

5. Messaging (MSG operation attribute).
This tests the ability of ICS to perform a remote operation without returning data and with or without the client awaiting completion. There are 2 sub-tests. In both tests data is returned to the invoking node by means of a double-hop message: test node to second node and second node back to test node. In the first test, both hops are performed synchronously - the caller blocks until the remote operation has completed. In the second test, each hop is fully asynchronous with the testing node waiting one second to allow the return message to be delivered.

6.
Sundry tests.
Finally, a series of unusual parameter configurations is tested to verify ICS stub generation logic:

a. no input and no output parameters;

b. no output parameter; and

c. no input parameter.

7. Operation placeholder (SKIP operation attribute).
A SKIPped operation is included in the .svc file to verify that no stubs are generated for it but that the operation is correctly allocated an unused entry in the generated server table.

3. High-level ICS

This section of the document describes the high-level ICS interface (often referred to as
the ICS interface). The high-level ICS code is used for all inter-node communication.
High-level ICS routines are of several categories:

general purpose routines for initializing ICS, declaring nodes up and down, etc.
ICS client routines which are called by icsgen-generated client stubs.
server control code, which controls the creation and destruction of server handles and server daemons, and invokes the icsgen-generated server stubs.
ICS server routines which are called by icsgen-generated server stubs.

This section of the document contains several subsections. The first describes how ICS
deals with CI/SSI services and communication channels; the next subsection describes
the ICS server control code; the following two subsections describes some typical
scenarios for the invocation of high-level ICS client and server code; and the last
subsections contain interface descriptions for all high-level ICS routines.

3.1. CI/SSI Services and ICS Communication Channels

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- ICS channels are 2-way communication channels implemented over

TCP.

- ICS_NUM_CHANNELS, ICS_MIN_PRIO_CHAN, ICS_MAX_PRIO_CHAN,

ICS_REPLY_CHAN, ICS_REPLY_PRIO_CHAN have been renamed to

lower-case equivalents:

ics_num_channels

ics_min_prio_chan

ics_max_prio_chan

ics_reply_chan

ics_reply_prio_chan

- Throttling/Flow control not supported. The code is

#ifdef'ed ICS_THROTTLE.

***********************************************************

CI/SSI components utilize services when communicating with ICS. CI/SSI messages and replies are sent out over communication channels. There are ICS_NUM_CHANNELS separate one-way communication channels that are used to communicated between nodes - they are identified by channel number, which goes from zero to (ICS_NUM_CHANNELS-1). Messages and RPCs sent using a CI/SSI service get mapped to a specific communication channel. More than one CI/SSI service may be mapped to a specific communication channel (the mapping of CI/SSI services to communication channels is done in the file <cluster/ics/ics_conf.c>).

Services are grouped together on a channel because communication channels are the entities that are flow-controlled (also known as throttled). Throttling of communication channels is accomplished by stopping all communication over the channel. This is done when the client or server starts running low on memory/resources, and it is desirable for the ICS subsystem (and the CI/SSI components using the ICS susbsystem) to not exhaust all memory/resources. On the server node, the resource limits used to check to see if throttling should occur exist on a per-communication-channel basis (see ics_svr_register() for details), and hence, the grouping of servers on communication channels is relevant.

However, if communication channels could be arbitrarily throttled, deadlock scenarios could arise (e.g., certain RPCs may need to complete to free resources, yet the communication channel needed to complete the RPC stays throttled until the resources are freed). To prevent such deadlocks, the concepts of ICS priority and priority communication channels exist. The communication channels ICS_MIN_PRIO_CHAN..ICS_MAX_PRIO_CHAN are deemed high priority channels, and these channels are never throttled. No CI/SSI services are mapped into the priority channels in <cluster/ics/ics_conf.c>. To make a message or RPC use one of these high priority communication channels, the routine ics_setpriority() should be used to raise the priority level of the current thread. An ICS priority of zero does not use the high priority communication channels (the mapping of CI/SSI services to communication channels specified in <cluster/ics/ics_conf.c> is utilized); an ICS priority of one will use the communication channel ICS_MIN_PRIO_CHAN (regardless of the CI/SSI service); an ICS priority of two will use the communication channel ICS_MIN_PRIO_CHAN+1 (regardless of the CI/SSI service); etc.

Some of the most common deadlock scenarios relate to performing an RPC within an ICS server. An example of this is performing an RPC from node A to node B; the communication channel becomes throttle; the RPC service on node B performs an RPC back to node A; this RPC would free resources on node A, yet the RPC cannot complete because the communication channel is throttled. To prevent these type of deadlock scenarios, the ICS server control code automatically raises the ICS priority of any outgoing message or RPC to be one higher than the incoming RPC which invoked the outgoing message/RPC (if it was an incoming message rather than an RPC, then ICS priorities are not automatically raised).

Replies to non-high-priority RPCs take place over the ICS_REPLY_CHAN communication channel. Replies to all high priority RPCs take place over the ICS_REPLY_PRIO_CHAN communication channel. Neither of these reply channels is ever throttled. No CI/SSI services are mapped to these channels in <cluster/ics/ics_conf.c>.

3.2. ICS Server Control Code

While a major portion of ICS is devoted to the routines called by the icsgen-generated stubs, another major piece of ICS is the server control code, which controls server daemons and server handles.

For ICS server daemons, the server control code will attempt to keep icsdaemon_avail_lwm to icsdaemon_avail_hwm ICS server daemons in an available state (not processing messages or RPCs). If the number of server daemons is less than icsdaemon_avail_lwm, more daemons are created. If the number of server daemons is greater than icsdaemon_avail_hwm, server daemons are destroyed. Server daemons are used to handle both normal and high priority messages and RPCs. However, there also exists a special server daemon for each priority greater than one. If, when a high priority message or RPC arrives, there are no server daemons that are in an available state, the special daemon for the designated priority level is used.

For ICS handles, the server control code attempts to keep lwm_svc_handles to hwm_svc_handles in an icssvr_handle__recv() state (these limits are on a per-CI/SSI-service basis - see ics_svc_register() for details). If the number of handles performing icssvr_handle_recv() becomes less than lwm_svc_handles, then more server handles are created. If the number of handles performing icssvr_handle_recv() becomes greater than hwm_svc_handles, then server handles are destroyed. `

3.3. Typical ICS Client Usage Scenarios

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- ICS messages cannot be sent in interrupt context.

This could be fixed, if we set all ICS channel's sk->allocation

to be GFP_ATOMIC, instead of GFP_KERNEL.

***********************************************************

This section presents a few typical examples of how high-level ICS code interfaces are used on a client node.

The examples in this section describe the typical code sequences generated by icsgen for client stubs.

The first example is for the client to perform an RPC (Remote Procedure Call). Note that this example assumes the client code is in thread context.

1.

Obtain a client handle using icscli_handle_get().

2.

Set up all the necessary input arguments using the icscli_encode_*() routines.

3.

Set up all the necessary information for out-of-line output arguments using the icscli_encoderesp_ool_*() routines.

4.

Call icscli_send() to send the message to the server. icscli_wait_callback() is typically specified as the callback routine.

5.

6.

Wait for the reply to come back, typically using icscli_wait().

Obtain all the output arguments from the response message using the icscli_decode_*() routines.

7.

Release the client handle using icscli_handle_release().

The second example is for the client to send a message to the server. Note that this example can take place in interrupt context, as long as the restrictions described in the example are followed.

1.

Obtain a client handle using icscli_handle_get(). If the IPC is taking place in interrupt context, then make sure that sleep_flag is FALSE.

2.

Set up all the necessary input arguments using the icscli_encode_*() routines. If the IPC is taking place in interrupt context, then make sure that only icscli_encode_*() routines that are appropriate for interrupt context are used.

3.

Call icscli_send() to send the message to the server. The callback routine would typically be icscli_handle_release_callback().

The third example is for the client to perform a ``broadcast RPC'', where the ``send'' portion of the RPC is performed to a series of nodes, then work is done locally, then the ``wait for response'' portion of the RPC is performed for the same series of nodes. This example could be used to implement spanning tree algorithms. This example assumes that the code is running in thread context.

1.

Obtain the appropriate client handles using icscli_handle_get() (one handle for each node).

2.

Set up all the necessary input arguments using the icscli_encode_*() routines (the argument encoding must take place for each handle).

3.

Set up all the necessary information for out-of-line output arguments using the icscli_encoderesp_ool_*() routines (again, this must take place for each handle).

4.

Call icscli_send() for each handle to send the messages to all the servers. The callback routine specified would typically be icscli_wait_callback().

5.

6.

Perform any work that can be performed locally.

Wait for all the replies to arrive back on this node by calling icscli_wait() for each handle.

7.

Obtain all the output arguments using the icscli_decode_*() routines (the argument decoding must take place for each handle).

8.

Release the client handles by calling icscli_handle_release() for each handle.

3.4. Typical ICS Server Usage Scenarios

This section presents a few typical examples of how high-level ICS code interfaces are used on a server node.

The server control code is responsible for obtaining the appropriate number of server handles using icssvr_handle_get(), and then calling icssvr_recv() for each of these handles. If the service is one that is to be performed from thread context, then the server control code is also responsible for creating the appropriate number of ICS server daemons, so that any messages received by the server can be processed in a timely fashion.

The following example describes how the server end of an RPC (Remote Procedure Call) is handled. This example assumes that the callback routine for icssvr_recv() has arranged to wake up a server daemon and that the following server code is executed within the context of the server daemon. In this example, steps 1 though 4 are performed by the icsgen-generated server stubs. The remaining steps are the responsibility of the server control code.

	1.	Obtain all arguments from the message by using calls to the `icssvr_decode_*()` routines.
	2.	Call `icssvr_decode_done()` to indicate that all argument decoding has been completed.
	3. 4.	Perform whatever server work there is to perform. Put any output arguments for the RPC into the response message by using the `icssvr_encode_*()` routines.
	5.	Call `icssvr_reply()`. The callback routine specified should merely arrange to call `icssvr_recv()`.
	6.	Put the daemon back to sleep, waiting for another message, in whatever fashion the server chooses.

The following example describes how the server end of an IPC (i.e. message send) is handled. This example assumes that the callback routine for icssvr_recv() has arranged to wake up a server daemon and that the following server code is executed within the context of the server daemon. In this example, steps 1 though 4 are performed by the icsgen-generated server stubs. The remaining steps are the responsibility of the server control code.

1. Obtain all arguments from the message by using calls to the
icssvr_decode_*() routines.

Call icssvr_decode_done() to indicate that all argument decoding has been completed.

Immediately reply to the client by using the icssvr_reply() call. The callback routine could merely arrange to call icssvr_recv().

Internode Communication Subsystem (ICS) Design Specification Page 25

	4.		Perform whatever server work there is to perform.
	5.		Put the daemon back to sleep, waiting for another message, in whatever fashion the server chooses.

The following example describes how the server end of an IPC (i.e. message send) is handled, when the entire server is to run in interrupt context (e.g. to handle STREAMS aqueduct messages). This example assumes that the callback routine for icssvr_recv() has arranged to call the following sequence of code, which possibly runs in interrupt context.

1.

Obtain all arguments from the message by using calls to the icssvr_decode_*() routines. Ensure that only calls to icssvr_decode_*() routines that are safe in interrupt context are used.

2.

Call icssvr_decode_done() to indicate that all argument decoding has been completed.

3.

Immediately reply to the client by using the icssvr_reply() call. The callback routine could arrange to call icssvr_recv().

4.

Perform whatever server work there is to perform, again noting that the code may be running in interrupt context.

5.

Return to the caller.

3.5. ICS General Purpose Routines

The following routines specify the generic (not client-specific and not server-specific) CI/SSI ICS interfaces.

3.5.1. `ics_svc_register()`

	ics_svc_register(

		int    nsc_service,

		char   *nsc_service_name,

		int    num_ops,

		int    (*service_ops[])(svr_handle_t),

		char   *service_ops_names[],

		int    lwm_svc_handles,

		int    hwm_svc_handles,

		int    lwm_svc_flowpages,

		int    hwm_svc_flowpages)

This routine is called once per ICS service, to give ICS servers the information necessary to deal with ICS messages that arrive on the server.

nsc_service is a number that identifies the CI/SSI service being initialized.

nsc_service_name is a string which contains the name of the CI/SSI service.

num_ops is the number of server operations in the service_ops table.

service_ops is an array of num_ops procedure pointers, with each procedure pointer representing a server operation.

service_ops_names is an array of num_ops strings, with each string containing the name of the respective service_ops array entry.

lwm_svc_handles and hwm_svc_handles represent low and high water mark targets for number of available handles to maintain for this service by the server. ``Available'' handles are handles that are available for incoming messages via calls to icssvr_handle_recv().

lwm_svc_flowpages and hwm_svc_flowpages represent the low and high water marks for server throttling. The server will throttle this service (i.e. not accept any incoming messages for a client) when the free memory for the system is less than lwm_svc_flowpages pages and will keep the server throttled until the free memory for the system exceeds hwm_svc_flowpages pages. Note that the server does not pay attention to these low and high water marks if an incoming client message has an ICS priority greater than zero (see later discussion on ICS priority).

Note that ics_svc_register() for a service must be called prior to incoming or outgoing communications using this particular service. Typically, this implies that all CI/SSI services must be registered using ics_svc_register() prior to calling ics_init().

This routine must be called from thread context.

3.5.2. `ics_init()`

	void

	ics_init()

This routine is called once at system initialization time. No ics_*() routines (except for ics_svc_register()) can be called before this routine has been called.

This routine must be called from thread context.

3.5.3. `ics_nodeinfo_callback()`

int

	ics_nodeinfo_callback(

		void	nodeup_notification(

				node_t node),

		void	nodedown_notification(

				node_t node))

This routine sets up notification callbacks, so that ICS can inform someone that it has detected other nodes in the cluster coming up or going down. It must be performed once per node (and only once per node) early in system initialization. nodeup_notification may be NULL, but it certainly should be non-NULL on the CLMS master node (which is the only node of the cluster that is guaranteed to receive the notification).

This routine that is typically called by the CI/SSI CLMS component during system initialization.

When the ICS subsystem detects that a node has come up, it will perform the callback (if nodeup_notification is non-NULL), but it will not allow communication with the node until ics_seticsinfo() has been called.

When the ICS subsystem detects that a node has gone down, it will perform the callback (if nodedown_notification is non-NULL), but communications requests to the node that has gone down will not return failures until ics_nodedown() is called.

Even if the nodeup_notification callback routine is non-NULL, the actual callback is only guaranteed to occur on the node clms_master_node.

The nodedown_notification callback routine will occur on some node of the cluster when a node goes down (and possibly more than one node may call the callback routine).

This routine must be called from thread context.

The notification callback routines must be called from thread context.

3.5.4. `ics_geticsinfo()`

int

	ics_geticsinfo(

		node_t     node,

		icsinfo_t  *icsinfo_p)

This routine is called to fetch the information that the ICS needs to make a connection with the specified node from its local database.

This routine is typically called from CLMS to extract information that can be passed to other nodes in the system.

The type icsinfo_t will vary in definition from implementation to implementation. For some implementations it may simply be a node number, for TCP/IP-based implementations, it may be a character string with the IP address. The CLMS component is not expected to manipulate this data type in any way - merely to pass it from node to node, so the destination node can perform a call to ics_seticsinfo() prior to calling ics_nodeup() for a particular node.

After the call to ics_init(), it is expected that a node will have icsinfo_t about itself and about the CLMS master node.

After a call to ics_seticsinfo() for a node, the information is available at any future time via a call to ics_geticsinfo().

This routine must be called from thread context.

3.5.5. `ics_seticsinfo()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- We still explicitly call ics_seticsinfo() for our own node (see

ics_llinit) and the CLMS master node (see clms_find_master).

***********************************************************

int

	ics_geticsinfo(

		node_t     node,

		icsinfo_t  *icsinfo_p)

This routine is called to store the information that the ICS needs to make a connection with the specified node. A call to ics_nodeup() cannot be made for the specific node until a call to ics_seticsinfo() has been completed (exceptions to this are the fact that the ICS subsystem knows about the icsinfo_t for itself and for the clms_master_node node - no call to ics_seticsinfo() needs to be done for these nodes).

This routine is typically called from CLMS to set up communication with another node. Typically, CLMS will have obtained this information from some node in the system.

The type icsinfo_t will vary in definition from implementation to implementation. For some implementations it may simply be a node number, for TCP/IP-based implementations, it may be a character string with the IP address. The CLMS component is not expected to manipulate this data type in any way - merely to pass it from node to node, so the destination node can perform a call to ics_seticsinfo() prior to calling ics_nodeup() for a particular node.

After a call to ics_seticsinfo() for a node, the information is available at any future time via a call to ics_geticsinfo().

After a call to ics_nodedown() for a node, ics_seticsinfo() must be called again prior to a call to ics_nodeup().

After a call to ics_seticsinfo(), the CLMS service is allowed to communicate with the specified node, though other services are not permitted to do so.

This routine must be called from thread context.

3.5.6. `ics_nodeup()`

int

	ics_nodeup(

		node_t	node)

This routine is called (typically by the CI/SSI CLMS component) to inform the ICS that it should allow communication by CI/SSI services to the specific remote node. It may be that ICS previously detected the fact that the node is coming up(i.e. it may be that ICS performed the nodeup notification specified by ics_nodeinfo_callback(), but this is not necessary, since CLMS may be have been informed that a node has come up by other means).

This routine returns zero if communication was successfully established with the specific node.

This routine returns EAGAIN if it was unable to establish communication successfully with the node (for example, the node may have gone down very soon after the ics_nodeup() call).

This routine is also called with the node number of the current node during system initialization. This is a bookkeeping measure, and the return value must always be zero in this case.

Once ics_nodeup() is called, all CI/SSI services are allowed to communicate with the specified node.

This routine must be called from thread context.

3.5.7. `ics_nodedown()`

int

	ics_nodedown(

		node_t	node)

This routine is called (typically by the CI/SSI CLMS component) to inform the ICS that a certain node is now declared down. It may be that ICS previous detected the fact that the node is down (i.e. it may be that ICS performed the nodedown notification specified by ics_nodeinfo_callback()), but this is not necessary, since CLMS mayhave been informed that a node has come up by other means.

During the process of declaring a node down via ics_nodedown(), all RPCs that are outstanding to the specified node are caused to fail. Additionally, any servers that are processing messages from the down node are signalled with the SIGKILL signal.

This routine must be called from thread context.

3.5.8. `ics_getpriority()`

int

	ics_getpriority(void)

This routine is called to obtain the current ICS priority of the current thread.

The ICS priority for any process is initially zero. It can be set by using the ics_setpriority() routine. When a process with an ICS priority greater than zero performs an RPC or sends a message, the communication will take place using the high priority ICS communication channels rather than the normal priority communication channels (high priority communication channels are not subject to throttling).

This routine is used internally by ICS and is also usable by other CI/SSI components.

This routine may be called from interrupt context.

3.5.9. `ics_setpriority()`

	void

	ics_setpriority(

		int	ics_prio)

This routine is called to obtain the current ICS priority of the current thread.

This routine is used internally by ICS and is also usable by other CI/SSI components.

This routine may be called from interrupt context.

3.6. ICS Client-side Routines

The routines of the high-level ICS client interface (excluding data-marshalling routines) are specified in this section.

The routines described here are invoked from icsgen-generated stubs (except for icscli_wouldthrottle() and icscli_waitfornothrottle()).

3.6.1. `icscli_handle_get()`

	cli_handle_t *

	icscli_handle_get(

		node_t      node,

		int         service,

		boolean_t   sleep_flag)

This routine is used to obtain a handle for the specified CI/SSI service which can be used for all subsequent icscli_*() routines. The message to be sent using this handle is destined for node node.

If service is the CLMS service then ics_seticsinfo() must have been done for the specified node node (without a corresponding ics_nodedown()). If service is any other service, then ics_nodeup() must have been done (successfully) for the specified node node (without a corresponding ics_nodedown()).

If sleep_flag is TRUE, this routine may sleep and wait for resources to become available. Additionally, this routine checks to see if client throttling for the particular node and service is occuring; if throttling is occurring, then the routine will sleep until the throttling condition subsides. If sleep_flag is TRUE, this routine will always return a handle.

If sleep_flag is FALSE, this routine may be being called from interrupt level, and may not sleep. If the needed resources are not immediately available, the routine will return NULL. Additionally, if client throttling for the particular node and service is occuring, this routine will return NULL.

Once the a handle has been obtained, any input arguments can be encoded using the icscli_encode_*() routines, any out-of-line output arguments can be encoded using the icscli_encoderesp_ool_*() routines, and then a message can be sent using icscli_send().

3.6.2. `icscli_wouldthrottle()`

int

	icscli_wouldthrottle(

		node_t      node,

		int         service)

This routine is called to determine whether client ICS communication over the communication channel specified by node and service is currently throttled. It is called by CI/SSI components that wish to check whether communication with a node can occur at the present time, prior to consuming resources necessary for communication.

If communication is throttled, then the routine returns TRUE, otherwise it returns FALSE.

This call is made entirely voluntarily.

This routine may be called from interrupt context.

3.6.3. `icscli_waitfornothrottle()`

int

	icscli_waitfornothrottle(

		node_t      node,

		int         service)

This routine is called to wait until client ICS communication over the communication channel specified by node and service is not throttled. If is called by CI/SSI components that wish to wait until communication is not throttled prior to consuming resources necessary for communication (the routine is also called by icscli_handle_get()); CI/SSI components calling icscli_wouldthrottle() do so entirely voluntarily (this call is not required).

If communication is not throttled at the time the call is made, the routine returns immediately with a return value of FALSE. If communication is throttled, then the routine waits for the throttled condition to subside, and then returns with a value of TRUE.

This routine must be called from thread context (since it can sleep).

3.6.4. `icscli_send()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- Setting the ICS_NO_REPLY flag means we are sending an ICS message

rather than an ICS RPC.

***********************************************************

	void

	icscli_send(

		cli_handle_t  *handle,

		int           procnum,

		int           flag,

		void          (*callback)(

				cli_handle_t  *handle,

				long          callarg),

		long          callarg)

This routine is called to send a message to a server on another node.

At the time this routine is called, all arguments for the message (i.e. input arguments) must have been encoded using the icscli_encode_*() routines, and any out-of-line output arguments for the response must have been encoded using the icscli_encoderesp_ool_*() routines.

procnum is a ``magic cookie'' that is passed to the server node. It is typically used to identify the operation to be performed by the service, and could be (for example) an index into an operations vector associated with the service.

flag is a bit-mask constructed by an OR of the following values:

ICS_NO_REPLY

This indicates that the client does not intend to wait for completion of the requested operation (and therefore there are no returned output arguments).

The callback routine must be non-NULL. The callback routine (with argument callarg, which was passed in to icscli_send()) is called once the message has been sent (successfully or unsuccessfully) and the response has been received (if one was expected). When the callback routine is invoked, either the ch_status field of the client handle can be examined or the routine icscli_get_status() can be invoked to figure out whether there was a successful outcome (a status value of zero indicates success).

Once the callback routine has been called, it is permissible to decode any returned arguments using the icscli_decode_*() routines, and to release the handle by calling icscli_handle_release().

This routine may be called from interrupt context.

The callback routine may be called from interrupt context.

3.6.5. `icscli_wait_callback()`

	void

	icscli_wait_callback(

		cli_handle_t  *handle,

		int           callarg)

If the client ICS code wishes to call icscli_wait() to wait for a message to be sent (and a reply to be received, if necessary), the routine name icscli_wait_callback can be specified as the callback parameter to icscli_send(). This routine should not be directly called.

See icscli_wait() for further details.

3.6.6. `icscli_wait()`

	void

	icscli_wait(

		cli_handle_t  *handle,

		int           flag)

This is a routine that is available to be called after icscli_send() if icscli_wait_callback() is specified as the callback routine for icscli_send(). If the send was for an RPC (the flag parameter to icscli_send() did not include ICS_NO_REPLY), then this routine will sleep waiting for the reply from the server to be received (or until the server node goes down) and the out-of-line data that was included in the RPC is no longer going to be referenced by the ICS subsystem. If the send was for a message (the flag parameter to icscli_send() included ICS_NO_REPLY), then this routine will return once the out-of-line data that was included in the message is no longer going to be referenced by the ICS subsystem. Once icscli_wait() has returned, the out-of-line data that was part of the message/RPC can be freed.

flag is a bit-mask constructed by an OR of the following values:

ICS_NO_SIG_FORWARD

If this bit is on, then icscli_wait() will not be
interrupted by any POSIX signals. If this bit is off, then
icscli_wait() will sleep in a manner in which it will
intercept POSIX signals. Any POSIX signals which are
intercepted will be forwarded to to the server, and the
icscli_wait() will store the signal and then go back
to sleep waiting for the reply to arrive or for another POSIX
signal to be intercepted. All stored signals will be requeued
for processing after icscli_wait() returns. Note the
ICS_NO_SIG_FORWARD flag is ignored when
ICS_NO_REPLY was on the the call to
icscli_send().

This routine must be called from thread context.

This routine returns the status of the message or RPC. A zero value indicates success. An error value of ERMOTE indicates that there were communication problems with the remote node (typically due to hardware problems).

3.6.7. `icscli_get_status()`

	void

	icscli_get_status(

		cli_handle_t  *handle)

If the client does not wish to use icscli_wait_callback() and icscli_wait() after performing an icscli_send(), then the client code can call this routine to find the status of the message or RPC that was sent using icslci_send().

A zero return value indicates success. An return value value of ERMOTE indicates that there were communication problems with the remote node (typically due to hardware problems).

This routine must be called after the callback routine for icscli_send() has been called.

This routine may be called from interrupt context.

3.6.8. `icscli_handle_release()`

	void

	icscli_handle_release(

		cli_handle_t  *handle)

This routine is used to release the resources associated with a client handle.

Once this routine has been called, it is not permissible to perform any other icscli_*() routines using this handle.

It is not permissible to re-use the same handle for two consecutive calls to icscli_send(); handles must be released after every icscli_send().

This routine may be called from interrupt context.

3.6.9. `icscli_handle_release_callback()`

	void

	icscli_handle_release_callback(

		cli_handle_t  *handle

		long          callarg)

This routine is identical to icscli_handle_release() except that it has an extra callback argument, which is ignored.

This routine exists so that an equivalent to icscli_handle_release() exists that can be used as a callback routine to icscli_send().

3.7. ICS Client-side Argument Marshalling

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no "pre-decoding" of out-of-line output arguments.

This was used to support DMA-push/DMA-pull over the Servernet

interconnect. Any icscli_encoderesp_*() routines are NO-OPs.

***********************************************************

The following routines are used by client code to set up the input arguments for any RPC or message (the icscli_encode_*() routines), and to obtain any RPC output arguments (the icscli_decode_*() and icscli_encoderesp() routines). Note that any of these routines can be macros.

These routines are invoked from icsgen-generated client stubs.

Note that it is not permissible to intermix calls to the icssvr_encode_*() and icssvr_decode_*() routines. The correct sequence is for the client to encode all the input arguments of the message, to pre-decode any out-of-line output arguments of the response, to send the message, and then to decode any output arguments from the response (if there is a response).

3.7.1. `icscli_??code_type()`

void

icscli_encode_int32(

	cli_handle_t *handle,

	int32 int32)

void

icscli_encode_int32_array_t(

	cli_handle_t *handle,

	int32 int32_array[],

	int nelem)

void

icscli_encode_char_array_t(

	cli_handle_t *handle,

	char char_array[],

	int nelem)

void

icscli_decode_int32(

	cli_handle_t *handle,

	int32 int32)

void

icscli_decode_int32_array_t(

	cli_handle_t *handle,

	int32 int32_array[],

	int nelem)

void

icscli_decode_char_array_t(

	cli_handle_t *handle,

	char char_array[],

	int nelem)

These routines are called to encode and decode the arguments of respective data types directly into (for encode) or from (for decode) the handle's argument in-line buffer.

Each call to icscli_encode_type() performed by the client side should be decoded by an identical call to icssvr_decode_type() on the server side. Similarly, each call to icssvr_encode_type() performed by the server side should be decoded by an identical call to icscli_decode_type() on the client side.

All arguments must be decoded in the same order they were encoded.

These routines may be called from interrupt context.

3.7.2. `icscli_??code*_data_t()`

int

	icscli_encode_ool_data_t(

		cli_handle_t *handle,

		u_char       *data,

		long         len,

		void         (*callback) {

				u_char   *data,

				long     callarg),

		long         callarg)

int

	icscli_encoderesp_ool_data_t(

		cli_handle_t *handle,

		u_char       *data,

		long         len)

int

	icscli_decode_ool_data_t(

		cli_handle_t *handle,

		u_char       *data,

		long         len)

These routines are called to encode and decode a block of untyped data into a message (for encode, i.e. input arguments) or from a response (for decode, i.e. output arguments). There is a limit of ICS_MAX_OOL_SEGMENTS calls to icscli_??code_ool_*() routines with non-NULL callback routines per message or reply.

The callback routine, if non-NULL, is called once the data is no longer required by the ICS low-level code. The callback routine (if provided) is typically a de-allocation routine of some sort.

Each call to icscli_encode_ool_data_t() performed by the client side should be decoded by an identical call to icssvr_decode_ool_data_t() on the server side. Similarly, Each call to icssvr_encode_ool_data_t() performed by the server side should be decoded by an identical call to icscli_decode_ool_data_t() on the client side.

Out-of-line ouput arguments require two calls a call to icscli_encoderesp_ool_data_t() before the call to icscli_send(), and a call to icscli_decode_ool_data_t() after the call to icscli_send(). The data is not actually guaranteed to be available until after the call to icscli_decode_ool_data_t(). The calls to both icscli_encoderesp_ool_data_t() and icscli_decode_ool_data_t() are required to support, in an efficient manner, low-level CI/SSI code that uses DMA-push, DMA-pull, and other techniques to transfer data. Calls to the icscli_encoderesp_ool_*() routines must be performed after all calls to icscli_encode_*() routines, and must be performed in the same order as the corresponding calls to the icscli_decode_ool_*() routines.

If these routines are successful, the return value is zero. If the return value is EREMOTE, then there was a communication problem with the server node (typically due to hardware problems).

These routines must be called from thread context.

The callback routine may be called from interrupt context.

3.7.3. `icscli_??code*_uio_t()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no support for these routines, as their data types do not

exist in Linux.

- Use icscli_??code_ool_struct_msghdr() instead.

***********************************************************

int

	icscli_encode_ool_uio_t(

		cli_handle_t *handle,

		struct uio   *uio,

		long         len,

		u_char       (*callback) {

				struct uio  *uio))

int

	icscli_encoderesp_ool_uio_t(

		cli_handle_t *handle,

		struct uio   *uio)

int

	icscli_decode_ool_uio_t(

		cli_handle_t *handle,

		struct uio   *uio)

These routines are used to encode and decode the data that is represented by a UNIX-style uio structure into a message (for encode, i.e. input arguments) or from a response (for decode, i.e. output arguments). Each uio structure can represent one or more ``chunks'' of data. There is no limit on the number of chunks that can be represented by the uio structure.

The callback routine, if non-NULL, is called once the data is no longer required by the ICS low-level code. The callback routine (if provided) is typically a de-allocation routine of some sort.

There is a limit of one icscli_??code_ool_uio_t() call per message or reply.

A uio structure encoded using icscli_encode_ool_uio_t() must be decoded using icssvr_decode_ool_uio_t(). Similarly, a uio structure encoded using icssvr_encode_ool_uio_t() must be decoded using icscli_decode_ool_uio_t().

The uio structure used by icscli_encode_ool_uio_t() on the client side does not need to correspond exactly to the uio structure used by icssvr_decode_uio_t() on the server side (i.e. the server can use different data chunk sizes and a different number of data chunks when decoding than were used in the encoding). Similarly, the uio structure used by icssvr_encode_ool_uio_t() on the server side does not need to correspond exactly to the uio structure used by icscli_decode_uio_t() on the client side. All the data that was encoded, however, must be decoded.

uio structures, as output arguments, require two calls: a call to icscli_encoderesp_ool_uio_t() before the call to icscli_send(), and a call to icscli_decode_ool_uio_t() after the call to icscli_send(). The data is not actually guaranteed to be available, until after the call to icscli_decode_ool_uio_t(). The calls to both icscli_encoderesp_ool_uio_t() and icscli_decode_ool_uio_t() are required to support, in an efficient manner low-level CI/SSI code that uses DMA-push, DMA-pull, and other techniques to transfer data. Calls to the icscli_encoderesp_ool_*() routines must be performed after all calls to icscli_encode_*() routines, and must be performed in the same order as the corresponding calls to the icscli_decode_ool_*() routines.

If these routines are successful, the return value is zero. If the return value is EREMOTE, then there was a communication problem with the server node (typically due to hardware problems).

These routines must be called from thread context.

3.7.4. `icscli_??code_mbuf_t()`

int

	icscli_encode_ool_mbuf_t(

		cli_handle_t *handle,

		struct mbuf  *mbuf_p,

		booleant_t   sleep_flag)

int

	icscli_decode_ool_mbuf_t(

		cli_handle_t *handle,

		struct mbuf  **mbuf_p_p,

		booleant_t   sleep_flag)

This routine is used to encode and decode the data that is represented by a socket-style mbuf structure into a message. mbuf_p and mbuf_p_p can be a chain of mbuf structures, representing one or more "chunks" of data. There is no limit on the number of mbuf structures in the chain.

There is a limit of one icscli_encode_ool_mbuf_t() and icscli_decode_ool_mbuf_t() call per message.

A mbuf structure encoded using icscli_encode_ool_mbuf_t() must be decoded using icssvr_decode_ool_mbuf_t(). Similarly, an mbuf structure encoded using icssvr_encode_ool_mbuf_t() must be decoded using icscli_decode_ool_mbuf_t().

For M_DATA messages, the chain of mbuf structures used by icscli_encode_ool_mbuf_t() on the client side does not need to correspond exactly to the chain of mbuf structures used by icssvr_decode_mbuf_t() on the server side (i.e. the server can use different data chunk sizes and a different number of mbuf's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

Similarly, for M_DATA messages, the chain of mbuf structures used by icssvr_encode_ool_mbuf_t() on the server side does not need to correspond exactly to the chain of mbuf structures used by icscli_decode_mbuf_t() on the server side (i.e. the client can use different data chunk sizes and a different number of mbuf's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

mbuf's that are not M_DATA messages can be sent from the client to the server (but not vice versa). For these mbuf's , the server will allocate a fresh data block, to preserve alignment and to allow access via the db_datap pointer.

For all data structures except mbuf_t's and mblk_t's, the high-lievel ICS server code is responsible for allocation of all the out-of-line data areas prior to calling the icssvr_decode_ool_*() routines. For the decoding of mbuf_t's, the low-level ICS code performs the necessary data allocation using the standard socket data allocation techniques. The high-level server code frees the socket message, at the appropriate time, by calling m_free() or m_freem().

Unlike the other icscli_encode_ool_*() routines, there is no callback ofr icscli_encode_ool_mbuf_t(). Instead, the standard socket routines m_free() or m_freem() are called, once the data is no longer required by the ICS low-level code.

If sleep_flag is FALSE, then these routines may be called from interrupt context. In this case, if the routines need to sleep to allocate resources, then they return EAGAIN. If these routines are successful, the return value is zero. If the return value is EREMOTE, then there was a communication problem with the server node (typically due to hardware problems).

3.7.5. `icscli_??code_mblk_t()`

int

	icscli_encode_ool_mblk_t(

		cli_handle_t *handle,

		struct mblk  *mblk_p,

		booleant_t   sleep_flag)

int

	icscli_decode_ool_mblk_t(

		cli_handle_t *handle,

		struct mblk  **mblk_p_p,

		booleant_t   sleep_flag)

These routines are used to encode and decode the data, represented by a STREAMS-style mblk structure, into a message. mblk_p and mblk_p_p can be chains of mblk structures, representing one or more ``chunks'' of data. There is no limit on the number of mblk structures in the chain.

There is a limit of one icscli_encode_ool_mblk_t() and icscli_decode_ool_mblk_t() call per message or reply.

A mblk structure encoded using icscli_encode_ool_mblk_t() must be decoded using icssvr_decode_ool_mblk_t(). Similarly, a mblk structure encoded using icssvr_encode_ool_mblk_t() must be decoded using icscli_decode_ool_mblk_t().

For M_DATA messages, the chain of mblk structures used by icscli_encode_ool_mblk_t() on the client side does not need to correspond exactly to the chain of mblk structures used by icssvr_decode_mblk_t() on the server side (i.e. the server can use different data chunk sizes and a different number of mblk's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

Similarly, for M_DATA messages, the chain of mblk structures used by icssvr_encode_ool_mblk_t() on the server side does not need to correspond exactly to the chain of mblk structures used by icscli_decode_mblk_t() on the server side (i.e. the client can use different data chunk sizes and a different number of mblk's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

mblk's that are not M_DATA messages can be sent from the client to the server (but not vice versa). For these mblk's, the server will allocate a fresh data block, to preserve alignment and to allow access via the db_datap pointer.

Unlike the other icscli_encode_ool_*() routines, there is no callback for icscli_encode_ool_mblk_t(). Instead, the standard STREAMS routine freemsg() is called once the data is no longer required by the ICS low-level code.

3.8. ICS Server-side Routines

The routines of the server interface (excluding data-marshalling routines) are specified in this section.

These routines are invoked from icsgen-generated server stubs (except for icssvr_nodedown_svc_wait()).

3.8.1. `icssvr_handle_get()`

	svr_handle_t *

	icssvr_handle_get(

		int	service,

		int	sleep_flag)

This routine is used to obtain a handle for the specified CI/SSI service which can be used for all subsequent icssvr_*() operations.

If sleep_flag has a value of TRUE, then this routine is allowed to sleep to wait for resources/memory to become available. In this case, the routine always returns a handle.

If sleep_flag has a value of FALSE, then this routine is not allowed to sleep for any reason. If resources/memory that are required are not available without sleeping, then the routine NULL, and the ICS high-level code will call the routine at a later time for the same handle to attempt again to allocate the resources. If the resources/memory that are required are immediately available, then the routine returns a handle.

This routine may be called from interrupt context (if sleep_flag is FALSE).

3.8.2. `icssvr_recv()`

	void

	icssvr_recv(

		svr_handle_t   *handle,

		void           (*callback)(

					handle_t *handle,

					long     callarg),

		long           callarg)

This routine is called to arrange for the specified callback routine to be called when a message from a client for the specific service associated with this handle arrives on the server (i.e. the message associated with a icscli_send() for the service associated with the server handle arrives on the server). The callback routine must be specified (i.e. callback cannot be NULL).

After the callback routine has been called, arguments encoded by the sender may be decoded using the icssvr_decode_*() routines.

After arguments have been decoded, the server may optionally encode any output arguments into a response using the icssvr_encode_*() routines (node that if the client is not expecting a reply, then the server must not call any of the icssvr_encode_*() routines). The server must call icssvr_reply() after processing all arguments.

This routine may be called from interrupt context.

The callback routine may be called from interrupt context.

3.8.3. `icssvr_decode_done()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- This routine is optional. If it is not called, ICS resources are freed in icssvr_llreply().

***********************************************************

	void

	icssvr_decode_done(

		svr_handle_t   *handle)

This routine may be called on the server after all calls to the icssvr_decode_*() argument decoding routines are done.

Calling this routine is entirely optional, but calling it does give the ICS subsystem a chance to free resources associated with the incoming message.

This routine may be called from interrupt context.

3.8.4. `icssvr_reply()`

	void

	icssvr_reply(

		svr_handle_t  *handle,

		void          (*callback)(

					svr_handle_t  *handle,

					long          callarg),

		long          callarg)

This routine may be called by the server if the client is expecting a reply (i.e. the client performed an RPC) and if the client is not expecting a reply (i.e. the client sent a message).

If the client is expecting a reply (the client call to icscli_send() did not use the ICS_NO_REPLY flag), then the act that this routine is called means that all arguments in the incoming message have been decoded, that the server will no longer reference any decoded out-of-line parameters, and that all arguments for the outgoing reply have been encoded. The response can now be sent back to the client. The callback routine will be called once the response has been sent and the handle is not required any more.

If the client is not expecting a reply (the client call to icscli_send() did use the ICS_NO_REPLY flag), then the fact that this routine is called means that all arguments in the incoming message have been decoded and that the server code will no longer reference any decoded out-of-line parameters. The callback routine will be called once the handle is not required any more.

The callback routine must be non-NULL, and the argument to the callback routine is the callarg that was passed to icssvr_reply().

This routine may be called from interrupt context.

The callback routine may be called from interrupt context.

3.8.5. `icssvr_handle_release()`

	void

	icssvr_handle_release(

		svr_handle_t   *handle)

This routine releases a handle previously fetched by icssvr_handle_get().

It is not necessary (and it is inefficient) to call this routine for every message handled by the server; the server is permitted to call icssvr_recv() repeatedly with the same handle.

This routine may be called from interrupt context.

3.8.6. `icssvr_nodedown_svc_wait()`

	void

	icssvr_nodedown_svc_wait(

		node_t     node,

		int        service)

This routine is available to be called by CI/SSI components directly during nodedown handling to make sure that there are no server daemons that are processing handles for a specific service any longer.

This routine is required because ics_nodedown() returns once all client interactions with the specified node have been completed and the server daemons have been sent a SIGKILL signal. However, merely sending this signal is not enough to guarantee that the server daemons actually terminate processing for any handles.

This routine must be called after ics_nodedown() is invoked for the specific node, and prior to the invocation of ics_nodeup() for the node.

This routine does nothing to ensure that daemons operating for the specific service and node actually complete - the CI/SSI component that is calling this routine must do this itself. This routine will wait indefinitely if the service does not complete on its own.

As this routine is allowed to sleep, this routine must be called from thread context.

3.9. ICS Server-side Argument Marshalling

The following routines are used by the server code to obtain the input arguments for any RPC or message (the icssvr_decode_*() routines), and to set up any RPC output arguments (the icssvr_encode_*() routines). Note that any of these routines can be macros.

These routines are invoked from icsgen-generated server stubs.

Note that it is not permissible to intermix calls to the icssvr_encode_*() and icssvr_decode_*() routines. The correct sequence is for the server to receive a message, decode all the input arguments of the message, and then to encode any output arguments into the response.

3.9.1. `icssvr_??code_type()`

void

icssvr_decode_int32(

	svr_handle_t *handle,

	int32 *int32)

void

icssvr_decode_int32_array_t(

	svr_handle_t *handle,

	int32 long_array[],

	int nelem)

void

icssvr_decode_char_array_t(

	svr_handle_t *handle,

	char char_array[],

	long nelem)

void

icssvr_encode_int32(

	svr_handle_t *handle,

	int32 int32)

void

icssvr_encode_int32_array_t(

	svr_handle_t *handle,

	int32 int32_array[],

	int nelem)

void

icssvr_encode_char_array_t(

	svr_handle_t *handle,

	char char_array[],

	long nelem)

These routines are called to decode and encode the respective data types directly from the message buffer (for decode, i.e. the client's input arguments) or into the response buffer (for encode, i.e. the client's output arguments).

All arguments must be decoded in the same order they were encoded.

These routines may be called from interrupt context.

3.9.2. `icssvr_??code_data_t()`

int

icssvr_decode_ool_data_t(

	svr_handle_t *handle,

	u_char       *data,

	long         len)

int

icssvr_encode_ool_data_t(

	svr_handle_t  *handle,

	u_char        *data,

	long          len,

	void          (*callback)(

				u_char  *data,

				long    callarg),

	long          callarg)

These routines are called to decode and encode a block of untyped data from a message (for decode, i.e. a client's input arguments) or into a response (for encode, i.e. a client's output arguments).

The callback routine, if non-NULL, is called once the data is no longer required by the ICS low-level code. The callback routine (if provided) is typically a de-allocation routine of some sort.

There is a limit of ICS_MAX_OOL_SEGMENTS calls to icssvr_??code_ool_*() with non-NULL callback routines per message or reply.

Each call to icscli_encode_ool_data_t() performed by the client side should be decoded by an identical call to icssvr_decode_ool_data_t() on the server side. Similarly, each call to icssvr_encode_ool_data_t() performed by the server side should be decoded by an identical call to icscli_decode_ool_data_t() on the client side.

If these routines are successful, the return value is zero. If the return value EREMOTE, then there was a communication problem with the client node (typically due to hardware problems).

These routines must be called from thread context.

3.9.3. `icssvr_??code_uio_t()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no support for these routines, as their data types do not

exist in Linux.

- Use icssvr_??code_ool_struct_msghdr() instead.

***********************************************************

int

icssvr_decode_ool_uio_t(

	svr_handle_t   *handle,

	struct uio     *uio)

int

icssvr_encode_ool_uio_t(

	svr_handle_t   *handle,

	struct uio     *uio,

	void           (*callback)(

				struct uio  *uio))

These routines are used to decode and encode the data that is represented by a UNIX-style uio structure from a message (for decode, i.e. client's input argument) or into a response (for encode, i.e. client's output argument). Each uio structure can represent one or more ``chunks'' of data. There is no limit on the number of chunks that can be represented by the uio structure.

The callback routine, if non-NULL, is called once the data is no longer required by the ICS low-level code. The callback routine (if provided) is typically a de-allocation routine of some sort.

There is a limit of one icssvr_*code_ool_uio_t() call per message or reply. A uio structure encoded using icscli_encode_ool_uio_t() must be decoded using icssvr_decode_ool_uio_t(). Similarly, a uio structure encoded using icssvr_encode_ool_uio_t() must be decoded using icscli_decode_ool_uio_t().

If the routines are successful, the return value is zero. If the return value is EREMOTE, then there was a communication problem with the client node (typically due to hardware problems).

These routines must be called from thread context.

3.9.4. `icssvr_??code_mbuf_t()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no support for these routines, as their data types do not

exist in Linux.

- Use icssvr_??code_ool_struct_msghdr() instead.

***********************************************************

int

icssvr_decode_ool_mbuf_t(

	svr_handle_t   *handle,

	struct mbuf    **mbuf_p_p,

	boolean_t      sleep_flag)

int

icssvr_encode_ool_mbuf_t(

	svr_handle_t   *handle,

	struct mbuf    *mbuf_p,

	boolean_t      sleep_flag)

This routine is used to encode and decode the data that is represented by a socket-style mbuf structure. mbuf_p and mbuf_p_p can be a chain of mbuf structures, representing one or more ``chunks'' of data. There is no limit on the number of mbuf structures in the chain.

There is a limit of one icssvr_decode_ool_mbuf_t() and icssvr_encode_ool_mbuf_t() call per message.

A mbuf structure encoded using icscli_encode_ool_mbuf_t() must be decoded using icssvr_decode_ool_mbuf_t(). Similarly, a mbuf structure encoded using icssvr_encode_ool_mbuf_t() must be decoded using icscli_decode_ool_mbuf_t().

For all data structures except mbuf_t's and mblk_t's, the high-level ICS server code is responsible for allocation of all the out-of-line data areas prior to calling the icssvr_decode_ool_*() routines. For the decoding of mbuf_t's, the low-level ICS code performs the necessary data allocation using the standard socket data allocation techniques. The high-level server code frees the socket message, at the appropriate time, by calling m_free() or m_freem().

For M_DATA messages, the chain of mbuf structures used by icscli_encode_ool_mbuf_t() on the client side does not need to correspond exactly to the chain of mbuf structures used by icssvr_decode_mbuf_t() on the server side (i.e. the server can use different data chunk sizes and a different number of mbuf's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

Similarly, for M_DATA messages, the chain of mbuf structures used by icssvr_encode_ool_mbuf_t() on the server side does not need to correspond exactly to the chain of mbuf structures used by

Internode Communication Subsystem (ICS) Design Specification Page 48

icscli_decode_mbuf_t() on the server side (i.e. the client can use different data chunk sizes and a different number of mbuf's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

mbuf's that are not M_DATA messages can be sent from the client to the server (but not vice versa). For these mbuf's, the server will allocate a fresh data block, to preserve alignment and to allow access via the db_datap pointer.

Unlike the other icscli_encode_ool_*() routines, there is no callback for icscli_encode_ool_mbuf_t(). Instead, the standard socket routines m_free() or m_freem() are called, once the data is no longer required by the ICS low-level code.

If sleep_flag is FALSE, then these routines may be called from interrupt context. In this case, if the routines need to sleep to allocate resources, then they return FALSE. If these routines are successful, the return value is zero. If the return value is EREMOTE, then there was a communication problem with the client node (typically due to hardware problems).

3.9.5. `icssvr_??code_mblk_t()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no support for these routines, as their data types do not

exist in Linux.

- Use icssvr_??code_ool_struct_msghdr() instead.

***********************************************************

int

icssvr_decode_ool_mblk_t(

	svr_handle_t   *handle,

	struct mblk    **mblk_p_p,

	boolean_t      sleep_flag)

int

icssvr_encode_ool_mblk_t(

	svr_handle_t   *handle,

	struct mblk    *mblk_p,

	boolean_t      sleep_flag)

This routine is used to decode the data that is represented by a STREAMS-style mblk structure. mblk_p and mblk_p_p can be a chain of mblk structures, representing one or more ``chunks'' of data. There is no limit on the number of mblk structures in the chain.

There is a limit of one icssvr_decode_ool_mblk_t() and icssvr_encode_ool_mblk_t() call per message or reply.

For all out-of-line data structures except mblk_t's and mbuf_t's, the high-level ICS server code is responsible for allocation of all the out-of-line data areas prior to calling the icssvr_decode_ool_*() routines. For the decoding of mblk_t's, the low-level ICS code performs the necessary data allocation using the standard STREAMS data allocation techniques. The high-level server code frees the STREAMS message, at the appropriate time, by calling freemsg().

For M_DATA messages, the chain of mblk structures used by icscli_encode_ool_mblk_t() on the client side does not need to correspond exactly to the chain of mblk structures used by icssvr_decode_mblk_t() on the server side (i.e. the server can use different data chunk sizes and a different number of mblk's in the chain when decoding than were used in the encoding). All the data that was encoded, however, must be decoded.

4. Low-level ICS

Low-level ICS is the transport-specific portion of ICS - i.e., it is expected that the code implementing the low-level ICS will vary from transport to transport.

Low-level ICS routines are invoked only from high-level ICS code - these routines are never called from icsgen-generated stubs or from other CI/SSI components.

Communication between high-level ICS and low-level ICS utilizes the concept of handles. There are two types of handles - client handles (type cli_handle_t), used by client-side ICS code; and server handles (type svr_handle_t), used by server-side ICS code. The handles are divided into two sections - the high-level ICS portion and the low-level ICS portion. The low-level ICS portion of the client handle is declared as ch_llhandle of type cli_llhandle_t in the cli_handle_t structure. The low-level ICS portion of the server handle is declared as sh_llhandle of type svr_llhandle_t in the svr_handle_t structure. The low-level ICS portion of the handle is entirely opaque to high-level ICS - it is merely allocated as part of the handle in icscli_handle_get()/icssvr_handle_get(). Some of the fields in the high-level portion of the handle are used for communication between high-level ICS and low-level ICS (see icscli_llsend(), icscli_sendup_reply(), icssvr_sendup_msg(), and icssvr_llreply() for details) - any remaining fields in the high-level ICS portion of the handle are opaque to the low-level ICS.

Low-level ICS is free to use the low-level ICS portion of the handle in any way it
chooses. Typical uses could include: tracking the encoding/decoding progress for input
arguments and output arguments, storage for the entire in-line buffer, locks required for
handle manipulation, etc.

The following sections describe how communications channels relate to low-level ICS,
provide typical scenarios for the interaction between high-level ICS and low-level ICS,
and then details the interface between high-level ICS and low-level ICS.

4.1. Low-level ICS and Communication Channels

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- ICS channels are 2-way communication channels implemented over

TCP.

- ICS_NUM_CHANNELS, ICS_MIN_PRIO_CHAN, ICS_MAX_PRIO_CHAN,
ICS_REPLY_CHAN, ICS_REPLY_PRIO_CHAN have been renamed to
lower-case equivalents:
ics_num_channels
ics_min_prio_chan
ics_max_prio_chan
ics_reply_chan
ics_reply_prio_chan
- Throttling/Flow control not supported. The code is
#ifdef'ed ICS_THROTTLE.

***********************************************************

The interfaces between high-level ICS and low-level ICS tell low-level ICS which
communication channel to use to send messages and replies. There are
ICS_NUM_CHANNELS separate one-way communication channels that are used to
communicate between nodes - they are identified by channel number, which goes from
zero to (ICS_NUM_CHANNELS-1).

These channels have special uses from the perspective of high-level ICS, but low-level
ICS is pretty much unaware of any such uses (see possible exceptions below).

Generally speaking, ICS does not send messages/replies to another node until
ics_nodeup() (and thus, ics_llnodeup()) has been called for that node.
Thus low-level ICS may use ics_llnodeup() as an opportunity to perform setup
for all channels. However, communication over the priority channels
ICS_MIN_PRIO_CHAN..ICS_MAX_PRIO_CHAN and the priority reply channel
ICS_REPLY_PRIO_CHAN may take place after the call to
ics_seticsinfo() (and thus, ics_llseticsinfo()) (for bootstrap
purposes). Low-level ICS needs to allow for communication over these channels under
the above circumstances.

Communication channels for ICS may be flow-controlled, or throttled. High-level ICS
decides when a communication channel is throttled, and when it is un-throttled; low-level
ICS implements the actual throttling (i.e., high level ICS code is responsible for
implementing the policy for throttling, and low-level ICS code is responsible for implementing the mechanism for throttling). The throttling decision is based upon resource usage on the server node. When a communication channel is throttled, the server is accepting no more messages/RPCs. See the interface descriptions for icscli_llwaitfornothrottle(), icscli_llwouldthrottle(), icssvr_find_recv_handle(), and icssvr_llhandles_present() for details about how throttling decisions are communicated between high-level and low-level ICS code.

Low-level ICS code may have additional throttling policy for the communication channels, but this should be implemented with care to not cause CI/SSI components to deadlock waiting for RPCs to complete (specifically, care should be taken with throttling decisions for the high-priority communication channels).

Additionally, there may be some hysteresis between the time a server declares a channel throttled and the throttling is visible on the client - as long as messages/RPCs are not lost, this is an acceptable scenario.

Low-level ICS code is entirely unaware of the high-level ICS concept of CI/SSI services and ICS priorities - it deals only with communication channels.

4.2. Low-level ICS Typical Client Scenarios

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- Throttling/Flow control not supported. The code is

#ifdef'ed ICS_THROTTLE.

- There is no encoding of responses for output arguments. This was

used to support DMA-push/DMA-pull over the Servernet interconnect.

Any icscli_encoderesp_*() routines are NO-OPs.

***********************************************************

This section presents how the high-level ICS client code and low-level ICS client code typically interact.

The following example would apply to any of the high-level ICS client scenarios that were described in the high-level ICS section of the document.

1.

When the client handle is obtained using icscli_handle_get(), the code in icscli_handle_get() will call either icscli_llwouldthrottle() or icscli_llwaitfornothrottle() to handle any communication throttling that may be occurring. Then the routine icscli_llhandle_init() is called to initialize the handle.

2.

Each call for encoding input arguments using the icscli_encode_*() routines will result in a call to the corresponding icscli_llencode_*() routine.

3.

Each call for encoding responses for output arguments using the icscli_encoderesp_*() routines will result in a call to the corresponding icscli_llencoderesp_*() routine.

4.

When the client calls icscli_send(), the routine icscli_llsend() will be called after the specified handle set-up has been performed. The low-level ICS code calls icscli_sendup_reply() when appropriate (when a reply arrives back for RPCs, when the message has been sent for messages). The routine icscli_sendup_reply() is responsible for performing the callback routine for icscli_send(). If this is an RPC, the low-level ICS code will call icscli_find_transid_handle() to determine which handle a reply corresponds to.

5.

If the original call was an RPC, then there are arguments to decode. Each call for decoding arguments using the icscli_decode_*() routines will result in a call to the corresponding icscli_lldecode_*() routine.

6.

When the client handle is released via a call to icscli_handle_release(), the routine icscli_llhandle_deinit() will be called.

4.3. Low-level ICS Typical Server Scenarios

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- The handle allocation routine icssvr_handle_get() calls

icssvr_llhandle_init().

***********************************************************

This section presents how the high-level ICS server code and low-level ICS server code typically interact.

The following example would apply to any of the high-level ICS server scenarios that were described in the high-level ICS section of the document.

1. The handle allocation routine icssvr_handle_get() calls
icssvr_handle_init().

2. The high-level ICS code calls icssvr_llhandle_init_for_recv() (typically from icssvr_recv()). If icssvr_llhandle_init() returns NULL, then icssvr_llhandle_init() will be called again later, from thread context (where it cannot return NULL).

3.

The next step varies, depending upon whether the low-level ICS code needs a handle to receive a message or not. If the low-level ICS code does need a handle to receive a message, it performs a call to icssvr_handle_recv() to obtain such a handle. If the low-level ICS code does not need a handle to receive a message, it performs a call to icssvr_handle_recv() after a message arrives. In either case, there will be an incoming message corresponding to a handle (note that this scenario does not cover the case where icssvr_handle_recv() returns NULL - see the interface descriptions for icssvr_handle_recv() and icssvr_llhandles_present() for details on this).

4.

The low-level ICS code calls icssvr_sendup_msg() when a message arrives on the node and the input arguments are ready to be decoded.

5.

Each call for decoding arguments using the icssvr_decode_*() routines will result in a call to the corresponding icssvr_lldecode_*() routine.

6.

The routine icssvr_lldecode_done() will typically be called from icssvr_decode_done().

7.

If the original call was an RPC, then there are output arguments to encode. Each call for encoding output arguments using the icssvr_encode_*() routines will result in a call to the corresponding icssvr_llencode_*() routine.

8.

The routine icssvr_llreply() will be called from icssvr_reply(). When the low-level ICS code is done with the handle, it will call icssvr_sendup_replydone(). icssvr_sendup_replydone() is responsible for ensuring that the callback routine for icssvr_reply() is called.

9.

If the high-level ICS code determines that the handle is to be re-used, go back to the second step. If the high-level ICS code determines that the handle is no longer required, then icssvr_handle_release() will call icssvr_llhandle_deinit().

4.4. Low-level ICS General Purpose Upcall Routines

The routines of the high-level ICS general purpose code that are called exclusively by the low-level ICS code (so-called upcall routines) are specified in this section.

4.4.1. `ics_nodedown_notification()`

	void

	ics_nodedown_notification(

		node_t  node)

This high-level ICS routine is called when the low-level ICS has detected that this node is having communication problems with the specified node. These communication problems are serious enough that the node is probably down.

It is expected that the low-level ICS code will invoke this routine only after a call to ics_llseticsinfo() (that does not have a corresonding call to ics_llnodedown()). Additionally, ics_nodedown_notification() should only be called once for a specific node by the low-level ICS code (needless to say, if the node is declared down via ics_llnodedown(), another call to ics_nodedown_notification() can be made after then next call to ics_llseticsinfo()).

Note that if the low-level ICS detects communication problems, it is expected to call ics_nodedown_notification(), but the routines icscli_sendup_reply() and icssvr_sendup_replydone() should not be called until ics_llnodedown() is invoked.

ics_nodedown_notification() is responsible for making the sure that the nodedown notification routine registered via the call to ics_nodeinfo_callback() is called.

High-level ICS code expects that ics_nodedown_notification() will be called on some node of the cluster when a node goes down - it could be any node. ics_nodedown_notification() may be called on multiple nodes of the cluster when a node goes down.

This routine must be called from thread context.

4.4.2. `ics_nodeup_notification()`

	void

	ics_nodeup_notification(

		node_t  node)

This high-level ICS routine is called when the low-level ICS has detected that a the node node has joined the cluster.

It is expected that the low-level ICS code will invoke this routine only if the node is not in a nodeup state (i.e. there should not have been a call to ics_llseticsinfo() for this node without a matching call to ics_llnodedown()). Additionally, ics_nodeup_notification() should only be called once for a specific node by the low-level ICS code (needless to say, if the node is declared up via ics_llseticsinfo() and then down via ics_llnodedown(), another call to ics_nodeup_notification() can be made).

ics_nodeup_notification() is responsible for making the sure that the nodeup notification routine registered via the call to ics_nodeinfo_callback() is called.

High-level ICS code expects that ics_nodeup_notification() will be called on the node clms_master_node when a node joins the cluster. The low-level ICS code is responsible for ensuring that the CLMS master node detects the node joining the cluster. Calls to ics_nodeup_notification() on other nodes of the cluster are optional.

This routine must be called from thread context.

4.5. Low-level ICS General Purpose Routines

The general purpose (not client-specific and not server-specific) low-level ICS interfaces are specified in this section. These routines are called exclusively by high-level ICS routines. `

4.5.1. `ics_llinit()`

	void

	ics_llinit()

This routine is called once at system initialization time. No ics_ll*() routines can be called before this routine has been called.

This routine must be called from thread context.

4.5.2. `ics_llgeticsinfo()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- After the call to ics_llinit(), it is expected that a node will

have icsinfo_t about itself.

- After the call to clms_find_master(), it is expected that a node

will have icsinfo_t about the CLMS master node.

***********************************************************

int

	ics_llgeticsinfo(

		node_t     node,

		icsinfo_t  *icsinfo_p)

This routine is called to fetch the information that the ICS needs to make a connection with the specified node from its local database.

This routine is called from the high-level ICS routine ics_geticsinfo().

The type icsinfo_t will vary in definition from implementation to implementation. For some implementations it may simply be a node number; for TCP/IP-based implementations, it may be a character string with the IP address. Other CI/SSI components are not expected to manipulate this data type in any way - merely to pass it from node to node, so the destination node can perform a call to ics_llseticsinfo() prior to calling ics_llnodeup() for a particular node.

After the call to ics_llinit(), it is expected that a node will have icsinfo_t about itself and about the CLMS master node.

After a call to ics_llseticsinfo() for a node, the information is available at any future time via a call to ics_llgeticsinfo().

This routine must be called from thread context.

4.5.3. `ics_llseticsinfo()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- A call to ics_llnodeup() cannot be made for a specific node until

a call to ics_llseticsinfo() has been completed. This applies also

to the current node (see ics_llinit) and the CLMS master node

(see clms_find_master).

***********************************************************

int

	ics_llseticsinfo(

		node_t     node,

		icsinfo_t  *icsinfo_p)

This routine is called to store the information that the ICS needs to make a connection with the specified node. A call to ics_llnodeup() cannot be made for the specific node until a call to ics_llseticsinfo() has been completed (exceptions to this are the fact that the low-level ICS subsystem knows about the icsinfo_t for itself and for the node clms_master_node - no call to ics_seticsinfo() needs to be done for these nodes).

This routine is called from the high-level ICS routine ics_seticsinfo().

After a call to ics_seticsinfo() for a node, the information is available at any future time via a call to ics_geticsinfo().

If a node is declared down via ics_llnodedown(), a fresh call to ics_llseticsinfo()must be made prior to declaring the node up via ics_llnodeup().

Once ics_llseticsinfo() is called, communication to the node node can occur over the high priority communication channels (this feature is typically used by the CLMS service only).

This routine must be called from thread context.

4.5.4. `ics_llnodeup()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- For all nodes, even the current node and the CLMS master node,

a call to ics_llseticsinfo() must be made prior to the call to

ics_llnodeup() (and after the last call to ics_llnodedown() for

that node).

***********************************************************

int

	ics_llnodeup(

		node_t     node)

This routine is called (by the high-level ICS routine ics_nodeup()) to inform the low-level ICS code that it should allow communication by CI/SSI services to the specific remote node. It may be that ICS previous detected the fact that the node is coming up (i.e. it may be that ICS performed the nodeup notification routine ics_nodeup_notification()), but this is not necessary.

This routine returns zero if communication was successfully established with the specified node.

This routine returns EAGAIN if it was unable to establish communication successfully with the node (for example, the node may have gone down very soon after the ics_llnodeup() call).

This routine is also called with the node number of the current node during system initialization. This is a bookkeeping measure, and the return value must always be zero in this case.

For all nodes except the current node and the node clms_master_node, a call to ics_llseticsinfo() must be made prior to the call to ics_llnodeup() (and after the last call to ics_llnodedown() for this node).

This routine must be called from thread context.

4.5.5. `ics_llnodedown()`

int

	ics_llnodedown(

		node_t     node)

This routine is called (by the high-level ICS routine ics_nodedown()) to inform the ICS that a certain node is now declared down. It may be that low-level ICS previously detected the fact that the node is down (i.e. it may be that ICS performed the nodedown notification routine ics_nodedown_notification()), but this is not necessary.

During the process of declaring a node down via ics_llnodedown(), all RPCs that are outstanding to the specified node are caused to fail. Additionally, any servers that are processing messages from the down node are signalled with the SIGKILL signal.

This routine must be called from thread context.

4.6. Low-level ICS Client-side Upcall Routines

The routines of the high-level ICS client code that are called exclusively by the low-level ICS client code (so-called upcall routines) are specified in this section. `

4.6.1. `icscli_find_transid_handle()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- In implementation, the transid is simply the memory address of

the handle.

***********************************************************

	cli_handle_t *

	icscli_find_transid_handle(

		long	transid)

This routine is called by the low-level ICS code when a reply arrives from the server and the low-level ICS client code needs to find the client handle that sent the RPC which is being replied to.

transid is the transaction ID that identifies the client handle. The transaction ID is the client handle field ch_transid, which was filled in by the high-level ICS code prior to the call to icscli_llsend(). The low-level ICS code is responsible for making sure that the transid follows the message to the server and is then passed back with the reply.

This routine expects a valid transaction ID - the result is undetermined when a transaction ID with no client handle is passed to this routine.

This routine may be called from interrupt context.

4.6.2. `icscli_sendup_reply()`

	void

	icscli_sendup_reply(

		cli_handle_t  *transid)

This routine is called by the low-level ICS code when one of the following conditions has occurred:

a.

the node to which the message/RPC was sent is down (or the node doesn't exist); in this case the ch_status field of handle is EREMOTE; or

b.

the call to icscli_llsend() was made with a non-zero value in the ch_status field of the handle; in this case the ch_status field of handle will still contain this non-zero value; or

c.

the caller of icscli_llsend() specified an RPC and the reply to the RPC has arrived and any out-of-line data that may have been part of the reply is is ready to be decoded using the icscli_decode_*() routines and the out-of-line data that was part of the original message will no longer be referenced by the low-level ICS code (and handle can be deinitialized); in this case the ch_status field of handle will have a value of zero; or

d.

the caller of icscli_llsend() specified a message (not an RPC) and the out-of-line data that was part of the original message will no longer be referenced by the low-level ICS code (and handle can be deinitialized); in this case the the ch_status field of handle will have a value of zero.

Once icscli_sendup_reply() is called, the high-level ICS code is free to call the callback routine specified by icscli_send().

This routine may be called from interrupt context.

4.7. Low-level ICS Client-side Routines

The routines of the low-level client interface (excluding data-marshalling routines) are specified in this section. These routines are called exclusively by high-level ICS routines.

4.7.1. `icscli_llhandle_init()`

	void

	icscli_llhandle_init(

		cli_handle_t  *handle,

		int           sleep_flag)

This routine is called by the ICS high-level code to initialize the portion of a client handle that is specific to the ICS low-level code.

handle specifies the client handle to be initialized.

If the sleep_flag parameter is TRUE, then the routine is allowed to sleep while waiting for resources/memory to become available (i.e. the routine must have been called from thread context). In this case the routine icscli_llhandle_init() must return with a 0 (success) return value. If the sleep_flag parameter is FALSE, then the routine is not allowed to sleep waiting for resources/memory to become available (i.e. the routine may have been called from interrupt context). If the resources are not immediately available, icscli_llhandle_init() must return with a EAGAIN (failure) return value; if the resources are immediately available, then the return value is 0 (success).

This routine cannot assume that the low-level ICS portion of the handle has been initialized to zeroes.

If the client handle is being re-used, then icscli_llhandle_deinit() will be called prior to calling this routine.

4.7.2. `icscli_llwouldthrottle()`

int

	icscli_llwouldthrottle(

		node_t   node,

		int      chan)

This routine is called by the high-level ICS routine icscli_wouldthrottle() to determine whether communication over the communication channel chan for node node is currently throttled.

If the specified communication channel is throttled, then the routine returns TRUE, otherwise it returns FALSE.

A communication channel is throttled if, on the server node node, the low-level ICS code performed a call to icssvr_find_recv_handle() for channel chan, and icssvr_find_recv_handle() did not return a handle (the low-level ICS code is allowed some hysteresis in having the ICS client low-level code detect that throttling is occurring on the communication channel to the node).

This routine may be called from interrupt context.

4.7.3. `icscli_llwaitfornothrottle()`

int

	icscli_llwaitfornothrottle(

		node_t   node,

		int      chan)

This routine is called by the high-level ICS routine icscli_waitfornothrottle() to wait until throttling over the communication channel chan for node node is not occurring.

If the communications channel is not throttled at the time the call is made, the routine returns immediately with a return value of FALSE. If the communications channel is throttled, then the routine waits for the throttled condition to subside, and then returns with a value of TRUE.

This routine must be called from thread context (since it can sleep).

4.7.4. `icscli_llsend()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- There is no encoding of responses for output arguments. This was

#used to support DMA-push/DMA-pull over the Servernet interconnect.

Any icscli_encoderesp_*() routines are NO-OPs.

- cli_hand[] => clihand_list[]

***********************************************************

int

	icscli_llsend(

		cli_handle_t   *handle)

This routine is called to send a message to a server on another node.

At the time this routine is called, all arguments for the message (i.e. input arguments) must have been encoded using the icscli_llencode_*() routines, and any out-of-line output arguments for the response must have been encoded using the icscli_llencoderesp_ool_*() routines.

This routine can assume that the high-level ICS code has set up the high-level ICS client handle fields ch_procnum, ch_flag, ch_service, ch_transid, ch_chan, ch_status, and ch_node. Additionally, if this message is part of an RPC, the handle will be placed on the cli_hand[] linked list by the ICS high-level code.

For the fields ch_procnum, ch_service, and ch_transid, the low-level ICS code is expected to set up the corresponding fields in the high-level portion of the server handle on the server node.

The field ch_flag is a bit-wise mask where one value is of relevance to the ICS low-level code - CLI_HANDLE_NO_REPLY, which, if set, indicates that this is a message which is not expecting a reply from the server node - the call to icscli_sendup_reply() should take place as soon as the encoded data is no longer going to be referenced by the ICS low-level code.

The field ch_node specifies the destination node where the message is to be sent.

The field ch_chan specifies which communications channel is to be used to send the messsage to the specified node.

The field ch_status is typically zero on entry to this routine. A non-zero value in ch_status when this routine is called indicates that no message should be sent to the destination node - instead, the low-level ICS code should call icscli_sendup_reply() as soon as it has performed any relevant clean-up of any encoded arguments.

The cli_hand[] variable represents an array of clihand_list_t structures which represent linked lists of handles that have outstanding RPCs. References to the list must be locked with the SPIN_LOCK_T field chlist_lock. The first element of the list is the chlist_first field, and the handles are linked using the high-level ICS handle field ch_next. The array which any handle will be on is calculated using CLIHAND_HASH(handle->ch_node). These linked lists can be examined by the ICS low-level code during node-down handling so that any outstanding RPCs when a node goes down can be interrupted. Note that the ICS low-level code can only traverse the linked lists, and not manipulate them.

When the message has been sent (either successfully or unsuccessfully) and any expected reply has also been received, the low-level ICS code is responsible for calling the high-level ICS routine icscli_sendup_reply().

This routine may be called from interrupt context.

4.7.5. `icscli_llhandle_deinit()`

	void

	icscli_llhandle_deinit(

		cli_handle_t   *handle)

This routine is used to deinitialize the low-level ICS portion of a client handle so that either the memory used for the handle can be freed or so that the handle can be re-used.

Note that there is no guarantee that a message was actually sent over the handle, though calls to icscli_llencode*() routines may have been made.

This routine may be called from interrupt context.

4.8. Low-level ICS Client-side Argument Marshalling

The low-level ICS client code has data marshalling routines that correspond exactly to the high-level ICS data marshalling routines. The name of the low-level ICS routine can be derived from the name of the corresponding high-level ICS routine by adding ll after the icscli_ in the high-level ICS routine name. For example, for the high-level ICS routine icscli_encode_int32(), the corresponding low-level ICS routine is called icscli_llencode_int32().

The interface to each low-level ICS routines is identical to the corresponding ICS high-level routine. Hence, the interfaces are not described in this section again.

4.9. Low-level ICS Server-side Upcall Routines

The routines of the high-level ICS server code that are called exclusively by the low-level ICS server code (so-called upcall routines) are specified in this section.

4.9.1. `icssvr_find_recv_handle()`

	svr_handle_t *

	icssvr_find_recv_handle(

		int	chan,

		int	svc)

This routine is called by the low-level ICS server code to obtain a handle to use to hold a message from a client. The handle that is returned will have been initialized via a successful call to icssvr_init_for_recv().

chan is the communication channel that was used to send the message to the server.

svc is the CI/SSI service which was specified on the client side (i.e., the ch_chan field of the client handle when the call to icscli_llsend() was performed).

This routine is permitted to not return a handle in this call (due to lack of server resources) (i.e. this routine is permitted to return NULL). In this case, the high-level ICS server code will later call icssvr_llhandles_present() once resources for the ICS communication channel chan are available. When icssvr_llhandles_present() is called, the low-level ICS code may take the opportunity to call icssvr_find_recv_handle() again.

If icssvr_find_recv_handle() returns NULL, the low-level ICS code is not permitted to call icssvr_find_recv_handle() again for this channel until the ICS high-level code has called icssvr_llhandles_present() for this communication channel.

A NULL return from icssvr_find_recv_handle() should be taken as an indication by the ICS low-level code to institute throttling on the specified communication channel. Similarly, the fact that the ICS high-level code has called icssvr_llhandles_present() should be taken as an indication that throttling on the specified communication channel should be turned off.

This routine may be called from interrupt context.

4.9.2. `icssvr_sendup_msg()`

	void

	icssvr_sendup_msg(

		svr_handle_t	*handle)

This routine is called by the low-level ICS server code once a handle that was obtained using a call to icssvr_find_recv_handle() contains a message and the message can be decoded via calls to the icssvr_decode*() routines.

Prior to calling this routine the low-level ICS code will have initialized the following fields in the high-level portion of the server handle: sh_transid, sh_node, sh_procnum, sh_flag, sh_service, and sh_chan.

The fields sh_transid, sh_procnum, and sh_service are passed through directly from the high-level ICS client code, and are not manipulated by the low-level ICS code.

The sh_node field is the node number of the node from which the message originated. The sh_chan field is the communication channel that was used to send the message (this is determined by the high-level ICS code on the client node).

The field sh_flag is a bit-wise mask where one value is of relevance to the ICS low-level code - SVR_HANDLE_NO_REPLY, which, if set, indicates that this is a message which is not expecting a reply.

This routine may be called from interrupt context.

4.9.3. `icssvr_sendup_replydone()`

	void

	icssvr_sendup_replydone(

		svr_handle_t	*handle)

This routine is called by the low-level ICS server code some time after icssvr_llreply() was called by the high-level ICS server code. This routine is called once the low-level ICS code no longer requires the handle for any purpose (if the handle was used for an RPC, this means that the reply has been sent).

The icssvr_sendup_replydone() routine will ensure that the callback routine specified for icssvr_reply() will be performed.

This routine may be called from interrupt context.

4.10. Low-level ICS Server-side Routines

The routines of the ICS low-level server interface (excluding data-marshalling routines) are specified in this section. These routines are called exclusively by high-level ICS routines.

4.10.1. `icssvr_llhandle_init()`

	void

	icssvr_llhandle_init(

		svr_handle_t    *handle,

		boolean_t       sleep_flag)

This routine is called by the high-level ICS code to initialize the portion of a server handle specific to the ICS low-level code. Server handles are constantly re-used to receive new messages, and icssvr_llhandle_init() is called only once for each handle. To prepare the handle to actually receive the message, the ICS high level code calls icssvr_llhandle_init_for_recv() (including before the first message is received).

This routine cannot assume that the low-level portion of the server handle has been initialized to all zeroes at the time this routine is called.

If sleep_flag has a value of TRUE, then this routine is allowed to sleep to wait for resources/memory to become available. In this case, the routine always returns zero for success.

If sleep_flag has a value of FALSE, then this routine is not allowed to sleep for any reason. If resources/memory that are required are not available without sleeping, then the routine returns EAGAIN, and the ICS high-level code will call the routine at a later time for the same handle to attempt again to allocate the resources. If the resources/memory that are required are immediately available, then the routine returns zero for success.

This routine may be called from interrupt context (if sleep_flag is FALSE).

4.10.2. `icssvr_llhandle_init_for_recv()`

***********************************************************
Errata 10/15/01 for Linux CI and SSI Clusters

- This routine only initializes the low-level portion of the server

handle.

- Throttling/Flow control not supported. The code is

#ifdef'ed ICS_THROTTLE.

***********************************************************

int

	icssvr_llhandle_init_for_recv(

		svr_handle_t    *handle,

		boolean_t       sleep_flag)

This routine is called by the ICS high-level code to prepare a server handle to receive a message from a client.

If sleep_flag has a value of TRUE, then this routine is allowed to sleep to wait for resources/memory to become available. In this case, the routine always returns zero for success.

If sleep_flag has a value of FALSE, then this routine is not allowed to sleep for any reason. If resources/memory that are required are not available without sleeping, then the routine returns EAGAIN, and the ICS high-level code will call the routine at a later time for the same handle to attempt again to allocate the resources. If the resources/memory that are required are immediately available, then the routine returns zero for success.

This routine only initializes the low-level portion of the client handle and does not let the low-level ICS code use the handle to receive messages - when a message arrives, the low-level ICS code must call the routine icssvr_find_recv_handle() to obtain a handle for an arriving message, and then call icssvr_sendup_msg() when data in the received message is ready to be decoded.

This routine may be called from interrupt context (if sleep_flag is FALSE).

4.10.3. `icssvr_llhandles_present()`

	void

	icssvr_llhandles_present(

		int	chan)

The low-level ICS server code calls icssvr_find_recv_handle() to obtain a handle for an arriving message. The message arrived on communications channel chan. The high-level ICS server code is permitted to not return a handle in this call (due to lack of server resources). In this case, the high-level ICS server code will later call icssvr_llhandles_present() once resources for the communication channel are available. When icssvr_llhandles_present() is called, the low-level ICS code may take the opportunity to call icssvr_find_recv_handle() again.

The high-level ICS code will call icssvr_llhandles_present() for a specific communications channel precisely once after a NULL return from icssvr_find_recv_handle() for the same communications channel.

icssvr_llhandles_present() may be called from interrupt context.

4.10.4. `icssvr_lldecode_done()`

	void

	icssvr_lldecode_done(

		svr_handle_t	*handle)

This routine may be called on the server after all calls to the icssvr_lldecode_*() argument decoding routines are done. This routine is typically called from icssvr_decode_done().

Calling this routine is entirely optional, but calling it does give low-level ICS a chance to free resources associated with the incoming message.

This routine may be called from interrupt context.

4.10.5. `icssvr_llreply()`

	void

	icssvr_llreply(

		svr_handle_t	*handle)

This routine is called by the high-level ICS server routine icssvr_reply(). This routine is called if the client is expecting a reply (i.e. the client is performing an RPC) and if the client is not expecting a reply (i.e., the client is sending a message).

If the client is expecting a reply, then the fact that this routine is called means that all arguments in the incoming message have been decoded (using the icssvr_lldecode_*() routines), and that all arguments for the outgoing reply have been encoded (using the icssvr_llencode_*() routines). The response can now be sent back to the client. The low-level ICS server code will call icssvr_sendup_replydone() once the low-level ICS code does not need the handle any more.

If the client is not expecting a reply, then the fact that this routine is called means that all arguments in the incoming message have been decoded (using the icssvr_lldecode_*() routines) and the high-level ICS server code no longer requires the handle. The low-level ICS server code will call icssvr_sendup_replydone() once the low-level ICS code does not need the handle any more.

If the client is expecting a reply, then the high-level ICS code will have set up the server handle field sh_chan to indicate the communication channel that the low-level ICS code should use to send the reply. Additionally, if the client is expecting a reply, the fields sh_transid, sh_node, sh_procnum, sh_flag, and sh_service will remain as they were when icssvr_sendup_msg() was called.

This routine may be called from interrupt context.

4.10.6. `icssvr_llhandle_deinit()`

	void

	icssvr_llhandle_deinit(

		svr_handle_t	*handle)

This routine is called by the ICS high-level code to deinitialize the low-level ICS portion of a server handle. No subsequent calls to icssvr_ll*() routines may be called with this handle, unless it is re-initialized using icssvr_llhandle_init().

This routine is typically called prior to freeing a handle from within icssvr_handle_free(). This routine is not called every time a handle is to be re-used for another message from a client.

This routine may be called from interrupt context.

4.11. Low-level ICS Server-side Argument Marshalling Routines

The low-level ICS server code has data marshalling routines that correspond exactly to the high-level ICS data marshalling routines. The name of the low-level ICS routine can be derived from the name of the corresponding high-level ICS routine by adding ll after the icssvr_ in the high-level ICS routine name. For example, for the high-level ICS routine icssvr_encode_int32(), the corresponding low-level ICS routine is called icssvr_llencode_int32().

The interface to each low-level ICS routine is identical to the corresponding ICS high-level routine. Hence, the interfaces are not described in this section again.

svc_name	is an identification for the service. It is used in constructing the names of data structures defining the service. Typically the prefix is the same or similar to the `.svc` filename - for example, the service defined by file `fbtok.svc` has service name `fbtok` (for remote file block token operations).
svc_id	is the unique number (or symbol representing this) of the ICS service.
hmin	is the minimum number of free ICS server handles held in reserve for this service.
hmax	is the maximum number of free ICS server handles held in reserve for this service.
mem_lwm	is the number of free memory pages below which server-side flow-control (also called throttling) is activated. If not specified, this value is assigned a default at run-time.
mem_hwm	is the number of free memory pages above which server-side flow-control ceases if active. If not specified, this value is assigned a default at run-time.

op_name	is the name of the operation, for example: `pvpop_fork`;
op_attr	is one of the following, defaulting to RPC:
	RPC	the operation is a remote procedure call for which at least a returned status value will be awaited.
	MSG	the operation is a simple message with no response expected
	ASYC	the operation is an RPC but the caller requests that send and receive sub-operations will be separately callable allowing non-blocking behavior.
	SKIP	the operation is a dummy placeholder; it enables operation entry numbers to be reserved at a lower-level with no user-level interface corresponding to them. Note that the op_name field must be given.
	In addition, there are the following modifiers (which can be concatenated to an operation attribute using the syntax op_attr`:`op_modifier):
	NO_SIG_FORWARD	requests that an RPC is not to be interrupted to forward signals on the server.
	NO_BLOCK	declares that a MSG operation may be called from interrupt level and that the client stub should not sleep awaiting system resources but should return with error (EAGAIN).
param_direction	is one of the following:
	IN	for an input paramter;
	OUT	for an output parameter; or
	INOUT	for a parameter that is both input and output.
param_attr	is a composite declaring the data attributes of a parameter, where this is a concatenation of the following separated conventionally by ":" (although any character is accepted):
	OOL	declares the parameter to be out-of-line data.
	XDR	declares the parameter is of complex datatype which has been defined as an XDR-able structure and is to be encoded/decoded using an xdr-routine (either generated by `rpcgen` or hand-written).
	VAR	declares the parameter to be of variable length with the length determined by the following implicitly-generated parameter. That is, the caller of this operation supplies an additional argument specifying the length of the variable parameter. This length is generally the number of basic elements in the supplied array: this is the number of bytes for a character array but the number of elements in an array of more complex type. See below for an example of this.
	PTR	declares that an input parameter being passed by reference is a a non-scalar type, which is not equivalent to intgeger type, and this pointer may be NULL. Parameters referenced by pointer must always be non-NULL - unless they have the PTR, VAR or OOL attribute.
	FREE	declares that an input parameter requires an addition call to free resources associated with it on the server-side after the remote server operation has been completed.
param_utype	is the user type of this parameter, for example: `struct vproc` or `cred_t`.
param_name	is the name of the parameter. Note that the name may be prefixed by "``"s indicating indirection; these are implicitly suffixed to param_utype*.
In addition to these declared parameters, extra parameters are implicit for operations. If these were to be explicitly declared, these would occur as the first parameters to each operation and would take the following forms. For non-`MSG` operations returning a value:
	`param IN node_t node` `param OUT int *rval`
and for `MSG` operations not returning any value:
	`param IN node_t node`
where:
`node`	is the node number of the destination node on which the operation is targeted; and
`rval`	is the return value of the non-message operation.
Variable length parameters declared with attribute `VAR` have an associated length parameter which is implicitly declared as the next parameter. Thus, a declaration of the form:
	`param IN:VAR char name[MAX_NAME_LEN]`
will be handled by ICS as if the following were declared:
	`param IN:VAR char name[MAX_NAME_LEN]`
	`param IN int name_len`

where:
OP_NAME	is the name of the operation (as it appears in the operation definition, without prefix and in upper-case);
`node`	is the node number identifying a remote node on which the operation is to be performed;

`rval`	is a variable to be assigned the return value of the remote operation;
`argument_list`	is the argument list (excluding destination node) for the operation.

where:
`handle`	is a message handle returned by `_SEND` and supplied to `_RECEIVE`.

	1.	Test primitive types and basic RPC support. a. Scalar integer values. `IN`, `OUT`, and `INOUT` integer values are exchanged. This verifies the basic operation of the ICS subsystem, the generated ICS client and server stubs and the interface with the ICS subsystem: in particular, routines `icscli_handle_get()`, `icscli_encode_int32()`, `icscli_decode_int32()`, `icssvr_decode_int32()`, `icssvr_encode_int32()`, `icscli_send()`, and `icscli_handle_release()`. b. Variable-length integer arrays. `IN`, `OUT`, and `INOUTVAR` integer arrays are exchanged. This verifies the generation of ICS stubs for these types and of ICS interface routines: `icscli_encode_int32_array_t()`, `icscli_decode_int32_array_t()`, `icssvr_decode_int32_array_t()`, and `icssvr_encode_int32_array_t()`. c. Fixed and variable length character arrays. `IN`, `OUT`, and `INOUT` fixed-length char arrays, and `IN` and `INOUT VAR` arrays are exchanged. This verifies the generation of ICS stubs for these types and of ICS interface routines: `icscli_encode_char_array_t()`, `icscli_decode_char_array_t()`, `icssvr_decode_char_array_t()`,`icssvr_encode_char_array_t()`. Also, the use of ICS marshaling macros is verified for the handling of fixed-length data stuctures (in this case a simple character array).
	2.	Out-of-line data. `IN`, `OUT`, and `INOUT` fixed-length char arrays, and `IN` and `INOUTVAR` arrays are exchanged as out-of-line data. This verifies the generation of ICS stubs for these types and of ICS interface routines: `icscli_encode_ool_data_t()`, `icscli_encoderesp_ool_data_t()`, `icscli_decode_ool_data_t()`, `icssvr_decode_ool_data_t()`, and `icssvr_encode_ool_data_t()`. In addition, ICS marshaling support routines to allocate and free memory are exercised.
	3.	Special data marshaling. This series of tests verifies special-purpose datatype handling provided at the ICS stub level. a. `PTR` and `FREE` parameter attributes. `PTR` parameters are potentially `NULL` pointers to `IN` data structures. The `FREE` attribute requests that the server stub call a free routine for a datatype after the remote procedure returns. This test verifies that non-`NULL` and `NULL` pointers are correctly marshalled and that the associated free routine is called. b. `uio` and `iovec` marshaling. This test verifies that a combined uio/iovec structure pair is correctly marshaled to and from a remote node. Note that such data structures are always `INOUT`. c. XDR and credentials type marshaling. This test exercises several capabilities: � the use of XDR marshaling routines to transfer data structure in the form of out-of-line encoded mblks; � ICS credentials marshaling macros for the exchange of credentials information (including `NULL` and system credentials) as ICS in-line data; and � the equivalence of credentials marshaled in either of these ways. This tests generated stub code and the support routines for interfacing with standard XDR marshaling routines and the production and consumption of data-encoded mblks.
	4.	Asynchronous capabilities (`ASYNC` operation attribute). In this test, an `ASYNC` operation is first preformed synchronously as a conventional RPC and then `_SEND` and `_RECEIVE` portions are performed separately with a one second delay between. This is a test of ICS stub generation only; the ICS subsystem does not distinguish between synchronous and split asynchronous operation.
	5.	Messaging (`MSG` operation attribute). This tests the ability of ICS to perform a remote operation without returning data and with or without the client awaiting completion. There are 2 sub-tests. In both tests data is returned to the invoking node by means of a double-hop message: test node to second node and second node back to test node. In the first test, both hops are performed synchronously - the caller blocks until the remote operation has completed. In the second test, each hop is fully asynchronous with the testing node waiting one second to allow the return message to be delivered.
	6.	Sundry tests. Finally, a series of unusual parameter configurations is tested to verify ICS stub generation logic: a. no input and no output parameters; b. no output parameter; and c. no input parameter.
	7.	Operation placeholder (`SKIP` operation attribute). A `SKIP`ped operation is included in the `.svc` file to verify that no stubs are generated for it but that the operation is correctly allocated an unused entry in the generated server table.

	1.		Obtain a client handle using `icscli_handle_get()`.
	2.		Set up all the necessary input arguments using the `icscli_encode_*()` routines.
	3.		Set up all the necessary information for out-of-line output arguments using the `icscli_encoderesp_ool_*()` routines.
	4.		Call `icscli_send()` to send the message to the server. `icscli_wait_callback()` is typically specified as the callback routine.
	5. 6.		Wait for the reply to come back, typically using `icscli_wait()`. Obtain all the output arguments from the response message using the `icscli_decode_*()` routines.
	7.		Release the client handle using `icscli_handle_release()`.

	1.		Obtain the appropriate client handles using `icscli_handle_get()` (one handle for each node).
	2.		Set up all the necessary input arguments using the `icscli_encode_*()` routines (the argument encoding must take place for each handle).
	3.		Set up all the necessary information for out-of-line output arguments using the `icscli_encoderesp_ool_*()` routines (again, this must take place for each handle).
	4.		Call `icscli_send()` for each handle to send the messages to all the servers. The callback routine specified would typically be `icscli_wait_callback()`.
	5. 6.		Perform any work that can be performed locally. Wait for all the replies to arrive back on this node by calling `icscli_wait()` for each handle.

	7.	Obtain all the output arguments using the `icscli_decode_*()` routines (the argument decoding must take place for each handle).
	8.	Release the client handles by calling `icscli_handle_release()` for each handle.

	1.	When the client handle is obtained using `icscli_handle_get()`, the code in `icscli_handle_get()` will call either `icscli_llwouldthrottle()` or `icscli_llwaitfornothrottle()` to handle any communication throttling that may be occurring. Then the routine `icscli_llhandle_init()` is called to initialize the handle.
	2.	Each call for encoding input arguments using the `icscli_encode_()` routines will result in a call to the corresponding `icscli_llencode_()` routine.
	3.	Each call for encoding responses for output arguments using the `icscli_encoderesp_()` routines will result in a call to the corresponding `icscli_llencoderesp_()` routine.
	4.	When the client calls `icscli_send()`, the routine `icscli_llsend()` will be called after the specified handle set-up has been performed. The low-level ICS code calls `icscli_sendup_reply()` when appropriate (when a reply arrives back for RPCs, when the message has been sent for messages). The routine `icscli_sendup_reply()` is responsible for performing the callback routine for `icscli_send()`. If this is an RPC, the low-level ICS code will call `icscli_find_transid_handle()` to determine which handle a reply corresponds to.

	5.	If the original call was an RPC, then there are arguments to decode. Each call for decoding arguments using the `icscli_decode_()` routines will result in a call to the corresponding `icscli_lldecode_()` routine.
	6.	When the client handle is released via a call to `icscli_handle_release()`, the routine `icscli_llhandle_deinit()` will be called.

	1.	The handle allocation routine `icssvr_handle_get()` calls `icssvr_handle_init()`.
	2.	The high-level ICS code calls `icssvr_llhandle_init_for_recv()` (typically from `icssvr_recv()`). If `icssvr_llhandle_init()` returns `NULL`, then `icssvr_llhandle_init()` will be called again later, from thread context (where it cannot return `NULL`).
	3.	The next step varies, depending upon whether the low-level ICS code needs a handle to receive a message or not. If the low-level ICS code does need a handle to receive a message, it performs a call to `icssvr_handle_recv()` to obtain such a handle. If the low-level ICS code does not need a handle to receive a message, it performs a call to `icssvr_handle_recv()` after a message arrives. In either case, there will be an incoming message corresponding to a handle (note that this scenario does not cover the case where `icssvr_handle_recv()` returns `NULL` - see the interface descriptions for `icssvr_handle_recv()` and `icssvr_llhandles_present()` for details on this).
	4.	The low-level ICS code calls `icssvr_sendup_msg()` when a message arrives on the node and the input arguments are ready to be decoded.
	5.	Each call for decoding arguments using the `icssvr_decode_()` routines will result in a call to the corresponding `icssvr_lldecode_()` routine.

	6.	The routine `icssvr_lldecode_done()` will typically be called from `icssvr_decode_done()`.
	7.	If the original call was an RPC, then there are output arguments to encode. Each call for encoding output arguments using the `icssvr_encode_()` routines will result in a call to the corresponding `icssvr_llencode_()` routine.
	8.	The routine `icssvr_llreply()` will be called from `icssvr_reply()`. When the low-level ICS code is done with the handle, it will call `icssvr_sendup_replydone()`. `icssvr_sendup_replydone()` is responsible for ensuring that the callback routine for `icssvr_reply()` is called.
	9.	If the high-level ICS code determines that the handle is to be re-used, go back to the second step. If the high-level ICS code determines that the handle is no longer required, then `icssvr_handle_release()` will call `icssvr_llhandle_deinit().`

	a.	the node to which the message/RPC was sent is down (or the node doesn't exist); in this case the `ch_status` field of `handle` is `EREMOTE`; or
	b.	the call to `icscli_llsend()` was made with a non-zero value in the `ch_status` field of the handle; in this case the `ch_status` field of `handle` will still contain this non-zero value; or
	c.	the caller of `icscli_llsend()` specified an RPC and the reply to the RPC has arrived and any out-of-line data that may have been part of the reply is is ready to be decoded using the `icscli_decode_*()` routines and the out-of-line data that was part of the original message will no longer be referenced by the low-level ICS code (and `handle` can be deinitialized); in this case the `ch_status` field of `handle` will have a value of zero; or
	d.	the caller of `icscli_llsend()` specified a message (not an RPC) and the out-of-line data that was part of the original message will no longer be referenced by the low-level ICS code (and `handle` can be deinitialized); in this case the the `ch_status` field of `handle` will have a value of zero.

	#used to support DMA-push/DMA-pull over the Servernet interconnect.
	Any icscli_encoderesp_*() routines are NO-OPs.

OPEN CI and OPEN SSI CLUSTER PROJECTS

1. ICS Messages and Responses

2. Generating ICS Code Using icsgen

2.1. Services

2.2. Operations

2.3. Datatypes and Encoding

2.4. Service Invocation Macros

2.5. Summary Format of Generated Client-side Stubs

2.6. Summary Format of Generated Server-side Stubs

2.7. Generated Prototypes

2.8. Generated Dispatch Table and Service Registration Routine

2.9. Interface Examples

2.10. icsgen

2.10.1. Syntax

2.10.2. Description

2.10.3. Options

2.10.4. Files

2.10.5. Examples

2.11. Source and Build Environment

2.12. Defining Encoding and Decoding

2.13. XDR-defined Datatypes

2.14. nsc_rcall() Emulation

2.15. Verification Tests

3. High-level ICS

3.1. CI/SSI Services and ICS Communication Channels

3.2. ICS Server Control Code

3.3. Typical ICS Client Usage Scenarios

3.4. Typical ICS Server Usage Scenarios

3.5. ICS General Purpose Routines

3.5.1. ics_svc_register()

3.5.2. ics_init()

3.5.3. ics_nodeinfo_callback()

3.5.4. ics_geticsinfo()

3.5.5. ics_seticsinfo()

3.5.6. ics_nodeup()

3.5.7. ics_nodedown()

3.5.8. ics_getpriority()

3.5.9. ics_setpriority()

3.6. ICS Client-side Routines

3.6.1. icscli_handle_get()

3.6.2. icscli_wouldthrottle()

3.6.3. icscli_waitfornothrottle()

3.6.4. icscli_send()

3.6.5. icscli_wait_callback()

3.6.6. icscli_wait()

3.6.7. icscli_get_status()

3.6.8. icscli_handle_release()

3.6.9. icscli_handle_release_callback()

3.7. ICS Client-side Argument Marshalling

3.7.1. icscli_??code_type()

3.7.2. icscli_??code*_data_t()

3.7.3. icscli_??code*_uio_t()

3.7.4. icscli_??code_mbuf_t()

3.7.5. icscli_??code_mblk_t()

3.8. ICS Server-side Routines

3.8.1. icssvr_handle_get()

3.8.2. icssvr_recv()

3.8.3. icssvr_decode_done()

3.8.4. icssvr_reply()

3.8.5. icssvr_handle_release()

3.8.6. icssvr_nodedown_svc_wait()

3.9. ICS Server-side Argument Marshalling

3.9.1. icssvr_??code_type()

3.9.2. icssvr_??code_data_t()

3.9.3. icssvr_??code_uio_t()

3.9.4. icssvr_??code_mbuf_t()

3.9.5. icssvr_??code_mblk_t()

4. Low-level ICS

4.1. Low-level ICS and Communication Channels

4.2. Low-level ICS Typical Client Scenarios

4.3. Low-level ICS Typical Server Scenarios

4.4. Low-level ICS General Purpose Upcall Routines

4.4.1. ics_nodedown_notification()

4.4.2. ics_nodeup_notification()

4.5. Low-level ICS General Purpose Routines

4.5.1. ics_llinit()

4.5.2. ics_llgeticsinfo()

4.5.3. ics_llseticsinfo()

4.5.4. ics_llnodeup()

4.5.5. ics_llnodedown()

2.10. `icsgen`

2.14. `nsc_rcall()` Emulation

3.5.1. `ics_svc_register()`

3.5.2. `ics_init()`

3.5.3. `ics_nodeinfo_callback()`

3.5.4. `ics_geticsinfo()`

3.5.5. `ics_seticsinfo()`

3.5.6. `ics_nodeup()`

3.5.7. `ics_nodedown()`

3.5.8. `ics_getpriority()`

3.5.9. `ics_setpriority()`

3.6.1. `icscli_handle_get()`

3.6.2. `icscli_wouldthrottle()`

3.6.3. `icscli_waitfornothrottle()`

3.6.4. `icscli_send()`

3.6.5. `icscli_wait_callback()`

3.6.6. `icscli_wait()`

3.6.7. `icscli_get_status()`

3.6.8. `icscli_handle_release()`

3.6.9. `icscli_handle_release_callback()`

3.7.1. `icscli_??code_type()`

3.7.2. `icscli_??code*_data_t()`

3.7.3. `icscli_??code*_uio_t()`

3.7.4. `icscli_??code_mbuf_t()`

3.7.5. `icscli_??code_mblk_t()`

3.8.1. `icssvr_handle_get()`

3.8.2. `icssvr_recv()`

3.8.3. `icssvr_decode_done()`

3.8.4. `icssvr_reply()`

3.8.5. `icssvr_handle_release()`

3.8.6. `icssvr_nodedown_svc_wait()`

3.9.1. `icssvr_??code_type()`

3.9.2. `icssvr_??code_data_t()`

3.9.3. `icssvr_??code_uio_t()`

3.9.4. `icssvr_??code_mbuf_t()`

3.9.5. `icssvr_??code_mblk_t()`

4.4.1. `ics_nodedown_notification()`

4.4.2. `ics_nodeup_notification()`

4.5.1. `ics_llinit()`

4.5.2. `ics_llgeticsinfo()`

4.5.3. `ics_llseticsinfo()`

4.5.4. `ics_llnodeup()`

4.5.5. `ics_llnodedown()`

4.6.1. `icscli_find_transid_handle()`

4.6.2. `icscli_sendup_reply()`

4.7.1. `icscli_llhandle_init()`

4.7.2. `icscli_llwouldthrottle()`

4.7.3. `icscli_llwaitfornothrottle()`

4.7.4. `icscli_llsend()`

4.7.5. `icscli_llhandle_deinit()`

4.9.1. `icssvr_find_recv_handle()`

4.9.2. `icssvr_sendup_msg()`

4.9.3. `icssvr_sendup_replydone()`

4.10.1. `icssvr_llhandle_init()`

4.10.2. `icssvr_llhandle_init_for_recv()`

4.10.3. `icssvr_llhandles_present()`

4.10.4. `icssvr_lldecode_done()`

4.10.5. `icssvr_llreply()`

4.10.6. `icssvr_llhandle_deinit()`