CI Project
  Overview
  License
  Download
  CVS
  Contributed Code
  Mailing Lists
  Goals
  Project List
  Features
  CLMS Features
  ICS Features
  Current Status
  Kernel Hooks
  Limitations / Enhancements
   
CI Documentation
  Kernel Patch
  Cluster Tools
   
Printer-friendly version

spawndaemon(1M)


spawndaemon -- user-level interface to the keepalive daemon

Synopsis

spawndaemon [ [-i cluster_wide] | [-i node_list] ]
      [-n times] [ [-Z last_node] | [-Z round_robin] ] 
      [-o] [-a] 
      [-c down=exit_code] [-c reject=exit_code] 
      -r process_cfg_file

spawndaemon [ [-i cluster_wide] | [-i node_list] ] 
      [ [-Z last_node] | [-Z round_robin] ] 
      [-o] [-a]
      -R process_cfg_file pid 

spawndaemon [ [-i cluster_wide] | [-i node_list] ] 
      [ [-Z last_node] | [-Z round_robin] ]
      [-o] -r group_cfg_file

spawndaemon [ [-i cluster_wide] | [-i node_list] ]  
      [-n times] 
      -F node [ [-B node] . . . ] 
      [ [-Z F_node] | [-Z last_node] | [-Z round_robin] ] 
      [-o] [-a] 
      [-c down=exit_code] [-c reject=exit_code] 
      -r process_cfg_file 

spawndaemon [ [-i cluster_wide] | [-i node_list] ]  
      -F node [ [-B node] . . . ] 
      [ [-Z F_node] | [-Z last_node] | [-Z round_robin] ] 
      [-o] [-a] 
      -R process_cfg_file pid 

spawndaemon [ [-i cluster_wide] | [-i node_list] ]
      -F node [ [-B node] . . . ] 
      [ [-Z F_node] | [-Z last_node] | [-Z round_robin] ] 
      [-o] -r group_cfg_file

spawndaemon [-i cluster_wide] [-n times] 
      -U [-o] [-a] 
      [-c down=exit_code]
      -r process_cfg_file 

spawndaemon [-i cluster_wide]
      -U [-o] [-a] 
      -R process_cfg_file pid 

spawndaemon [-i cluster_wide] 
      -U [-o]
      -r group_cfg_file

spawndaemon [-k] -x -D full_pathname 

spawndaemon [-k] -x -P pid

spawndaemon [-k] -x -S slot

spawndaemon [-k] [-a] -d full_pathname [arg_list]

spawndaemon [-k] -p pid

spawndaemon [-k] -g group_name

spawndaemon [-k] -s slot

spawndaemon -q 

spawndaemon [-k] -Q 

spawndaemon -L [-v human [processes | keepalive] ]

spawndaemon -L [-v machine [processes | keepalive] ]

spawndaemon -X

spawndaemon -z max_processes

Description

The spawndaemon command provides the command-line interface to the keepalive(1M) daemon, which monitors processes and daemons, and restarts those processes/daemons when they die. keepalive can monitor processes and daemons individually and in groups.

Specifically, spawndaemon performs the following tasks:

  • Processes the spawndaemon command-line options and the associated configuration files.

  • Sends messages to the keepalive daemon, which then performs all process/daemon startup and monitoring tasks.

  • Reads information from keepalive's monitored process table and displays that information for the system administrator.

To configure a process or daemon to be monitored, perform the steps described in the following paragraphs. For more information about particular files and directories, see the Files section later in this reference manual page.

  1. Create a process configuration file for the process or daemon and store it in the /etc/spawndaemon.d directory. For information about configuration file syntax, owner and permission requirements, see the Configuration Files section later in this reference manual page.

  2. Create a script in one of the /etc/rc*.d directories, and in that script call spawndaemon to register the process/daemon with keepalive. Select the appropriate spawndaemon syntax from the preceding Synopsis section; the spawndaemon command references the configuration file created in the previous step. For information about spawndaemon options, see the Options section later in this reference manual page.

  3. Create a startup script in the /etc/keepalive.d directory. The keepalive daemon calls this startup script to start the process or daemon and (if so configured) to restart the process/daemon if it fails.

  4. Create other (optional) scripts to be used by keepalive for handling your process/daemon. Examples of optional scripts include: restart on process/daemon failure, restart on node failure, clean up when the number of process/daemon failures exceeds the configured limit, and shut down when directed manually by the spawndaemon command. These scripts must also reside in the /etc/keepalive.d directory.

  5. Start the process/daemon, either by calling spawndaemon from the command line of your system, or by restarting your system and allowing the various scripts in the /etc/rc*.d directories and the /etc/keepalive.d directory to start both keepalive and your processes/daemons.

  6. Verify that your process/daemon is registered correctly by using the spawndaemon -L and -v options to read the registration information maintained by keepalive.
To configure a process/daemon group for monitoring, each group member must be registered and have a process configuration file and startup/recovery scripts similar to an individual process/daemon. In addition, a group configuration file is required as described under Configuration Files.

Options

The spawndaemon command uses the following options:
-a
When used for process/daemon registration, the -a option specifies that a process/daemon be registered with keepalive by name and argument list. Including an argument list in the registration provides a method to distinguish between processes/daemons having the same name. The argument list is contained in the process configuration file (see arg_list under Configuration Files) designated with the -r or -R options. The -a option is intended to be used with processes that daemonize themselves. The -a option cannot be used with group registrations.

When -a is specified, keepalive searches the system process table using both the process name and arg_list to find and register the process ID (PID) of the process after it has daemonized. When -a is not used, only the process name is used in the search of the process table to find the PID following daemonization.

arg_list does not have to contain all of the arguments used by the process/daemon command line(s) for startup and recovery (restart). However, arg_list must contain one or more of the arguments in the same order (with none missing in the sequence provided) as they appear on the command line(s), starting at the beginning of the argument list.

When the -a option is used to unregister a process/daemon (see -d option) that has been registered with an argument list, the arg_list must be included with the -d option in the spawndaemon command line.

-B node
A single use of the -B option specifies the number of the (backup) node on which the process/daemon is to be executed if the node specified with the -F option is unavailable. For more information on the nodes in your cluster, see cluster(1M). The -B option can only be used if the -F option is also used.

The -B option can be used more than once. If the node specified by the first use of the -B option is unavailable, keepalive uses the node specified by the second use of the -B option, and so on. The -B option can be used up to 12 times. Refer to the -Z option for the available restart policies involving the nodes specified with the -F and -B options.

-c down=exit_code
This option tells keepalive to enable the down feature for the process/daemon being registered. The down feature allows a process/daemon to instruct keepalive to take that process/daemon to the keepalive down state. keepalive will not restart any process/daemon in the down state. In order to restart such a process/daemon, the spawndaemon command with the -x option must be used.

When a process/daemon exits to the down state, it must return the exit code you specify in exit_code. exit_code must be an integer other than zero (0). If exit_code is 0 (or if the -c reject option has already been specified with the same exit_code), registration fails and the process/daemon is not started. keepalive communicates the down exit code to the process/daemon through the KEEPALIVE_PROCESS_DOWN environment variable, the value of which is the exit_code you specify in the call to spawndaemon. KEEPALIVE_PROCESS_DOWN is only set if spawndaemon is called with the -c down=exit_code option. See the Configuration Files section for information about the down_script and down_script_policy fields in the process configuration file used to specify and control the execution of the script that keepalive calls when the process/daemon goes to the down state. This process/daemon down feature is not supported for group registrations.

-c reject=exit_code
This option tells keepalive to enable the node-rejection feature for the process/daemon being registered. The node-rejection feature allows a process/daemon having a resource problem (for example, insufficient memory) to reject the node on which it is running. In rejecting a node, the process/daemon sends exit_code to keepalive, which causes keepalive to fail the process/daemon over to another node. keepalive chooses the new node based on the node selection policy specified by the -Z option.

If a process/daemon rejects all of the nodes in the cluster, keepalive clears the list of rejected nodes except for the most recently rejected node. However, if that node is the only available node, keepalive clears it also. With the rejected node list cleared, keepalive begins anew trying to move the process/daemon to another node. The error counter for the process/daemon is not reset. Each node rejection counts as a failure; therefore, keepalive eventually takes the process/daemon to the down state (when max_errors failures occur within probation_period seconds as specified in the process configuration file).

When a process/daemon rejects a node, it returns the exit code that you specify in exit_code. exit_code must be an integer other than zero (0). If exit_code is 0 (or if the -c down option has already been specified with the same exit_code), registration fails and the process/daemon is not started. keepalive communicates the node rejection exit code to the process/daemon through the KEEPALIVE_NODE_REJECT environment variable, the value of which is the exit_code you specify on the call to spawndaemon. KEEPALIVE_NODE_REJECT is only set if spawndaemon is called with the -c reject=exit_code option.

If a node failure recovery script has been specified in the process configuration file, keepalive runs that script to recover from a node rejection. If no node failure recovery script has been specified, keepalive runs the process failure recovery script (if specified) or the startup script. See the Configuration Files section for information about specifying the script that keepalive calls when the process/daemon rejects a node. The node-rejection feature is not supported for group registrations.

-d full_pathname [arg_list]
Unregisters the named process/daemon. When specifying full_pathname, use a full path name as designated by the full_path_to_executable field in the process configuration file. When -d is used with the -a option, arg_list must be specified. See the -a option for details.

-D full_pathname
Identifies a process/daemon by name to the -x option. When specifying full_pathname, use a full path name as designated by the full_path_to_executable field in the process configuration file.

-F node
Specifies the number of the (favored) node on which the process/daemon is to be executed. keepalive pins the process/daemon on the specified node so that the process/daemon cannot be migrated by using the load_leveld(1) utility; the pinned process/daemon ignores any migration request. cluster(1M) can be used to get information on the nodes in your cluster.

The nodes designated with the -F and -B options form the set of nodes on which the process/daemon is allowed to run. The restart policy option (-Z) determines how nodes are selected from this set when the process/daemon needs to be restarted or fails to (re)start on a selected node.

-g group_name
Unregisters the processes/daemons in the group specified by group_name, where group_name is defined in the group configuration file for the group.

-i cluster_wide | -i node_list
Enable idempotency enforcement on a cluster-wide or node-list basis. The -F option is required and -B is optional when -i node_list is used. Idempotency enforcement is performed at registration time and persists after registration is completed. Idempotency for the registered process/daemon instance continues to be enforced whether or not the -i option is used on subsequent attempts to register additional instances of the same process/daemon. A special spawndaemon exit code of 2 (see Return Values) indicates an idempotency violation.

When cluster_wide is specified, keepalive's monitored process table is searched. In the case of registering a single process/daemon, if the process/daemon is already registered, spawndaemon counts this as an idempotency violation and does not register the new instance. In the case of registering a group, if any member of the group is already registered for that group, then spawndaemon counts this as an idempotency violation and does not register the new group instance. Otherwise, in both cases, registration proceeds as normal.

When node_list is specified, keepalive's monitored process table is searched. In the case of registering a single process/daemon, the node list for the new registered instance must be mutually exclusive with the existing registered instances of the same process/daemon. If the node lists are not mutually exclusive, spawndaemon counts the registration attempt as an idempotency violation and does not register the new instance. In the case of registering a group, the monitored process table is searched to verify that the new group instance has a node list that is mutually exclusive with any other group instances of the same name already registered with keepalive. If a group instance of the same name is already registered with a node list that is not mutually exclusive with the new group instance, spawndaemon counts this as an idempotency violation and does not register the new group instance.

Group registrations, both cluster wide and node list, are only affected by other registered instances of the same group. When spawndaemon attempts to register a group of processes and discovers that one or more processes within the group have been previously registered as non-grouped processes, it still registers all the processes specified in the group. If you want two groups containing the same set of processes to run at the same time, be sure to specify different names for the two groups.

Group registrations can prevent single process registrations. If a group is registered and the -i cluster_wide option is used to attempt to register a single process/daemon that is a member of the group, the registration attempt is considered an idempotency violation. If a group is registered and the -i node_list option is used to attempt to register a single process/daemon with the same node list as another instance of the process/daemon that belongs to the group, the registration attempt is considered an idempotency violation.

-k
Invokes the shutdown script (named in the process configuration file) for the process/daemon identified by another option in the command line. Options for identifying the process/daemon include -d, -p, -s, -D, -P, and -S. If no shutdown script exists for the process/daemon, a SIGTERM signal is sent to the process/daemon. The termwait option specified in the process configuration file can be used to define a time interval (in seconds) that keepalive gives the process/daemon to shut down before sending it a SIGKILL signal. The time interval defaults to two seconds.

keepalive attempts to run the shutdown script on the node where the process/daemon was last active. If that node is down, the other nodes in the cluster are tried to fork/exec the shutdown script until a node is found or the list of available nodes is exhausted. As each attempt is made, warning messages are posted to the system log. If it is impossible to run the shutdown script, an error message is posted to the system log and system console.

When -k is used with -g to shut down a group, shutdown scripts are used for those group members that have them; SIGTERM is used for those group members that do not have a shutdown script.

-L
Displays the state of the monitored process table in an abbreviated form. If the keepalive daemon has been quiesced (see -q option), a message to that effect follows the table. The -v (verbose) options display/read the full details of the monitored process table.

-n times
Used with the -r option to specify the number of times keepalive registers and starts a process/daemon. If the -i option is also specified, the processes/daemons are started only if no copies of the process/daemon are already running (so you can make sure that all old copies of a process/daemon are gone before starting new ones). The -n option cannot be used with group registrations.

-o
A process registered with the -o option must not daemonize itself. This option instructs keepalive to not perform daemonization recovery, thereby optimizing keepalive's performance for processes of this type. If -o is used for group registration, none of the group members can be daemons.

-p pid
Specifies the process ID of a process/daemon to be unregistered.

-P pid
Identifies a process/daemon by process ID to the -x option.

-q
Quiesces keepalive; that is, prevents keepalive from monitoring processes or daemons. keepalive continues to maintain its internal status. spawndaemon is still functional while keepalive is quiesced; however, monitored processes/daemons are not restarted if they die with keepalive in this quiesced state.

Use the -X option to resume normal keepalive operation.

-Q
Shuts down the keepalive daemon, cleans up, and exits. As part of the cleanup procedure, -Q clears keepalive's monitored process table, which means that all process/daemon registrations are lost. This action does not affect the processes or daemons themselves. They continue to run and are unaffected. Use the -k option with -Q if you also want to shut down all of the monitored processes/daemons.

To shut down keepalive, but leave the monitored process table intact, send keepalive a SIGTERM signal so it performs a controlled exit.

For keepalive to remain down, /etc/inittab must be edited to remove the keepalive entries.

-r process_cfg_file
Registers the process/daemon named in the process configuration file specified in process_cfg_file. After the process/daemon is registered with keepalive, keepalive starts the process or daemon by calling its startup script. The process configuration file, which contains the process/daemon's characteristics, must be stored in the /etc/spawndaemon.d directory and its name must begin with the prefix ka_. The -r and -R options are mutually exclusive.

-r group_cfg_file
Registers all the processes/daemons named in the group configuration file specified by group_cfg_file. As soon as each process/daemon is registered, keepalive starts the process/daemon by calling that process/daemon's startup script.

When spawndaemon attempts to register a group of processes/daemons and discovers that one or more processes/daemons within the group have been previously registered as non-grouped processes/daemons, it still registers all the processes/daemons specified in the group. If spawndaemon attempts to register a group of processes/daemons and discovers that an identically-named group of processes/daemons has already been registered, it does not register the new group, but exits with an error (return value of 1). To register two or more groups containing the same set of processes/daemons, specify different group names for the groups.

You must create the group configuration file and a process configuration file for each group member in the /etc/spawndaemon.d directory. For information about the required format of the configuration files, see Configuration Files.

-R process_cfg_file pid
Registers an already-running process/daemon with keepalive. You must provide the name of the process/daemon to be registered in the process configuration file specified by process_cfg_file. The configuration file must be stored in the /etc/spawndaemon.d directory. For information about configuration file naming and syntax requirements, and required access control permissions, see Configuration Files.

You must specify the process ID of the running process/daemon in pid. If the specified process/daemon with that PID is already registered, spawndaemon logs an error and no new registration takes place.

-s slot
Specifies the slot number (in the monitored process table) of a process/daemon to be unregistered. Process/daemon slot numbers can be read with the -v human processes and -v machine processes options.

-S slot
Identifies a process/daemon by monitored process table slot number to the -x option. Process/daemon slot numbers can be read with the -v human processes and -v machine processes options.

-U
Directs keepalive to spawn the process/daemon without pinning it to a particular node. Unpinned processes/daemons can be migrated from one node to another whenever they are sent a migration request, such as when the load_leveld(1) utility is active. If -U is not used, the process/daemon being registered is pinned to a node when spawned (see -Z option).

The default migration handler performs the migration; however, the default handler does not migrate a process/daemon's kernel objects (file descriptors and so on) to the new node. Those objects remain on the original node. If the original node fails, the migrated process/daemon can no longer access its kernel objects, which can lead to unpredictable behavior from the process/daemon. To preserve your process/daemon's kernel objects during migration, implement your own migration handler, and have that handler destroy and rebuild kernel objects (such as file descriptors) during migration.

To run multiple instances of the same process/daemon, you must be consistent in how you use the -U option. You cannot mix pinned and unpinned instances with the same name; keepalive requires such consistency in order to perform daemonization recovery on an instance that has failed. When the -o option is used to register an instance, this consistency restriction does not apply (processes registered with -o do not daemonize).

-v human [processes | keepalive]
When used with the -L option, displays the full contents of keepalive's monitored process table in a human-readable format. The following fields are displayed for each process/daemon registered with keepalive when the processes option is specified:

Monitored Process Table Process/Daemon Fields
Field Description
state The keepalive state for the process/daemon. See Process/Daemon States for a description of each state.
pid The current process identification number (PID) of the process/daemon.
node number The number of the node on which the process/daemon is running.
full path to process The complete path to the process/daemon.
argument list The arguments (if any) used by keepalive to distinguish this process/daemon from others by the same name.
child of keepalive Set TRUE if the process is a child of keepalive; otherwise (such as when daemonization recovery is underway), set FALSE.
daemonization recovery Set TRUE if the process has daemonized itself; otherwise, set FALSE.
pinned Indicates whether the process/daemon has been designated to run on one or more specific nodes in the cluster. If set TRUE, the process/daemon is pinned to one or more nodes. If set FALSE, the process/daemon can float (migrate) among the nodes in the cluster.
lastexeced Indicates when the process/daemon was last started.
process first died Indicates the first time the process/daemon stopped before being restarted.
process last died Indicates the last time the process/daemon stopped before being restarted
min. respawn Specifies the number of seconds the process/daemon must run before it is eligible for restarting.
num. errors The number of errors (such as process/daemon failures) that have occurred during the current probation period, which starts when the process/daemon fails and the error count is set to one (1). The error count includes process/daemon failures, node rejections by the process/daemon (see -c reject option), and node failures.
total errors Specifies the total number of errors since the process/daemon was first started. See num_errors for the events included in the error count.
max. errors during probation Specifies the maximum number of errors (process/daemon failures) allowed before keepalive will no longer respawn the process/daemon (leaving the process/daemon in the down state). The maximum number of errors must occur during the specified probation period in order for the process/daemon to be left in the down state.
probation period Specifies the time, in seconds, during which the number of errors specified by max_errors_during_probation must occur in order for the process/daemon to be taken to the down state.
registration policy One of the following methods by which the process/daemon is registered: Name, meaning keepalive looks for the process/daemon by name (the -a and -o options were not used to register the process/daemon); Argument List, meaning keepalive looks for the process/daemon by name and argument list (the -a option was used); PID, meaning keepalive looks for the process/daemon by process ID (-o option was used).
node selection policy Identifies the node selection policy as specified by the -Z option.
favored node The node on which the process/daemon is executed, as specified by the -F option. If no node is specified, this value is None.
backup nodes Specifies the nodes on which the process/daemon is executed if the favored node is unavailable. If no nodes are specified, this value is None.
rejected nodes A list of nodes that the process/daemon has rejected or the keepalive node selection policy has rejected.
termwait The time interval (in seconds) that keepalive gives the process/daemon to shut down before sending it a SIGKILL signal.
euid The user identification number of the process/daemon.
egid The group identification number of the process/daemon.
startup script The name of the script that keepalive executes when it starts the process/daemon.
shutdown script The name of the script that keepalive executes when it shuts down the process/daemon.
process failure recovery script The name of the script that keepalive executes when it restarts the process/daemon after it fails.
node failure recovery script The name of the script that keepalive executes when it restarts a process/daemon whose node has failed.
down script The name of the script that keepalive executes when a process/daemon enters the down state.
group The name of the registration group to which the process/daemon belongs. If the process/daemon does not belong to a group, None is displayed.
critical group process Set to TRUE or FALSE to indicate whether or not the process/daemon is critical to its group. Set to N/A if the process/daemon is not a member of a group.
reject node exit code Exit code specified by -c reject option. Set to None if the process/daemon was not registered with the -c reject option.
down exit code Exit code specified by -c down option. Set to None if the process/daemon was not registered with the -c down option.
exit status returned If a process/daemon is not running, this is the exit code associated with the most recent failure/exit. If a process/daemon is running, None is reported.
last pid The PID of the last failed process/daemon in this slot.
slot The slot number of the process/daemon in this table.

The following fields are displayed when the keepalive option is specified:

Monitored Process Table Keepalive Fields
Field Description
running Set TRUE if keepalive is running; otherwise, set FALSE.
quiesce flag Set TRUE if keepalive is currently quiesced with the -q option; otherwise, set FALSE.
pid The process identification number for keepalive.
node number The number of the node on which keepalive is running.
registered processes The total number of processes/daemons currently registered with keepalive, which equates to the number of entries in the monitored process table.
table size The current number of slots in the monitored process table. Each registered process/daemon uses one slot in the table. "table size" is less than or equal to "max. possible processes"; if less than "max. possible processes," it is because keepalive has not allocated memory for the maximum number of slots.
max. possible processes The maximum number of processes/daemons that can be registered, which equates to the maximum size of the monitored process table. This is a minimum of 200 and can be increased with the -z option.
polling Set TRUE or FALSE, indicating whether keepalive detects a process/daemon failure by polling rather than via child process adoption (that is, on receipt of the SIGCHLD signal).
polling interval The time in seconds that keepalive uses as a polling interval. This can be controlled with the -t option of keepalive(1M).
primary node The primary node on which the keepalive process is executed. This can be controlled with the -P option of keepalive(1M).
secondary node The secondary node on which the keepalive process is executed when the primary node is unavailable. This can be controlled with the -P option of keepalive(1M).

-v machine [processes | keepalive]
Displays the full contents of keepalive's monitored process table in a format that is parsible by shell utilities or by a C program. The record for each process/daemon is listed on a single line and is terminated with a newline character. Each field in the record appears in the format FIELD="VALUE" or FIELD=VALUE. Each field/value pair is terminated by a semi-colon (;). VALUE is enclosed within double quotes when keepalive provides string values (as when the semi-colon delimiter appears in the value of a field and the field is a string). In fields that contain lists of values, a comma separates each VALUE or "VALUE". The following field/value pairs are returned for each process/daemon when the processes option is specified:

Field No.
Field="Value" or Field=Value
1 state="<current state of the process/daemon>";
2 pid="<pid of process/daemon>" or "None";
3 node_number="<number of node running process/daemon>" or "None";
4 full_path_to_process="<full pathname for process/daemon>";
5 arg_list="<list of arguments>" or "" (null string);
6 child_of_keepalive=TRUE or FALSE;
7 daemonization_recovery=TRUE or FALSE;
8 pinned=TRUE or FALSE;
9 lastexeced="<time process/daemon was last exec'ed>";
10 process_first_died="<time process/daemon first died>" or "Never";
11 process_last_died="<time process/daemon last died>" or "Never";
12 minrespawn=<minimum time allowed between respawns>;
13 num_errors=<number of process/daemon failures within current probation_period>;
14 total_errors=<total number of failures for this process/daemon>;
15 max_errors_during_probation=<maximum number of failures allowed within probation_period>;
16 probation_period=<probation period in seconds>;
17 registration_policy="<method by which process/daemon is registered>";
18 node_selection_policy="<node selection policy>";
19 favored_node="<node designated with -F option>" or "None";
20 backup_nodes="<node designated with -B option>, <node designated with -B option>, ..." or "None";
21 rejected_nodes="<rejected node>, <rejected node>, ..." or "None";
22 termwait=<time allowed (in seconds) to shut down before sending SIGKILL>;
23 euid=<user ID of process/daemon>;
24 egid=<group ID of process/daemon>;
25 startup_script="<pathname of startup script>";
26 shutdown_script="<pathname of shutdown script>" or "None";
27 process_failure_recovery_script="<pathname of process failure recovery script>" or "None";
28 node_failure_recovery_script="<pathname of node failure recovery script>" or "None";
29 down_script="<pathname of down script>" or "None";
30 group="<group name>" or "None";
31 critical_group_process="TRUE" or "FALSE" (if a group member) or "N/A";
32 reject_exit_code="<exit code used to reject host node>" or "None";
33 down_exit_code="<exit code used to take process/daemon down>" or "None";
34 exit_status_returned="<exit code if process/daemon not running>" or "None";
35 last_pid="<pid of last failed process/daemon>" or "None";
36 slot=<monitored process table slot number of this process/daemon>;

Refer to the -v human option for a description of the fields for each process/daemon. The following field/value pairs are returned when the keepalive option is specified:

Field No.
Field="Value" or Field=Value
1 running=TRUE or FALSE;
2 quiesce_flag=TRUE if keepalive quiesced, FALSE otherwise;
3 pid="<pid of keepalive>" or "None";
4 node_number="<number of node running keepalive>" or "None";
5 registered_processes=<current number of registered processes>;
6 table_size=<current size of monitored process table>;
7 max_possible_processes=<maximum number of processes/daemons that can be registered (maximum size of monitored process table)>;
8 polling=TRUE if keepalive detects process/daemon failure via polling, FALSE if SIGCHLD is used to detect failure;
9 polling_interval=<keepalive polling interval>;
10 primary_node="<primary node on which keepalive is executed>" or "None";
11 secondary_node="<secondary node on which keepalive is executed>" or "None";

Refer to the -v human option for a description of the fields for keepalive.

-x
If a process/daemon is down for any reason (for example, because it has reached its maximum allowed error count within the specified probation period), its error (failure) count is cleared and the process/daemon is restarted. For information about max_errors and probation_period, see the Configuration Files section of this reference manual page.

If a process/daemon is running, the -x option clears the error count so it appears the process/daemon has not failed. If -k is used with -x, the process/daemon is shut down and restarted, which can be used to restart a process/daemon that appears hung.

-X
Resumes normal keepalive operation. Call spawndaemon with this option after you quiesce keepalive with the -q option.

-z max_processes
Increases the size of the monitored process table (the maximum number of processes/daemons that keepalive can monitor) to the value specified in max_processes. By default, keepalive can monitor 200 processes/daemons; this option can only be used to increase the maximum size of the table.

-Z F_node | round_robin | last_node
Specifies a node selection policy for the process/daemon. A node selection policy only has meaning if the spawned process/daemon is to be pinned to a node; therefore, do not use this option if the spawned process/daemon is not to be pinned (see -U option). If the -Z and -U options are not used to register individual processes/daemons, the node selection policy defaults to round_robin. For group registrations, the node selection policy defaults to last_node.

If you specify F_node, the node specified with the -F option is used first for the restart attempt. If this favored node is not available, the node(s) designated with the -B option are tried next. If none of the designated nodes are available, the restart attempt is delayed until one of the nodes becomes available. Use of F_node is not valid unless the -F option is also used.

If you specify round_robin without using the -F and -B options, keepalive picks the first available node in the cluster on which to restart the process/daemon, unless the last restart attempt on that node failed, in which case keepalive picks the next available node. keepalive maintains a list of visited nodes; when all nodes in the list have been visited, the list is cleared and reused until the process/daemon reaches max_errors within the probation_period. Specify round_robin if you do not care which node runs your process/daemon. Note however, that the -F and -B options can be used with round robin to designate the nodes used in the restart attempts. The round_robin node selection policy offers the highest level of availability since the node on which the process/daemon most recently failed/exited is avoided.

If you specify last_node, keepalive first tries to restart the process on the last node on which it was running before trying to restart it on any other node. If the restart attempt fails, the nodes designated with the -F and -B options are used on subsequent restart attempts in the order specified on the spawndaemon command line. If the -F option is used without the -B option, the restart attempts are repeated on the -F node as long as it is available. If no -F (or -B) option is used, the restart is attempted on an available node in a round robin fashion. The last_node restart policy is useful in those situations where resources (such as a shared memory segment) may persist across process/daemon failures.

Return Values

The spawndaemon command returns the following values:
  • 0 - The operation has completed successfully.
  • 1 - A fatal error has occurred.
  • 2 - An idempotency violation has occurred.

Files

/dev/keepalivecfg
Named pipe that keepalive uses for receiving commands from the spawndaemon utility.

/etc/keepalive.d
Directory where you store the scripts for managing monitored processes/daemons. For each monitored process/daemon, you must provide a startup script.

Optionally, you can provide additional scripts for keepalive to call when other events occur, such as process/daemon failure, node failure, or when the process/daemon goes to the keepalive down state (meaning, for example, that the monitored process/daemon has produced more errors than allowed within its specified probation period and, therefore, is not restarted by keepalive). For more information about optional scripts, see the Configuration Files section later in this reference manual page.

The user and group ID must be root for all scripts. In addition, access control permission for each script must be set to rwx r-x r-x (755).

/etc/keepalive.d/keepalive.data
keepalive stores the state of all monitored processes and daemons in this memory-mapped data file referred to as the monitored process table. If you remove /etc/keepalive.d/keepalive.data, the keepalive daemon should be shut down with the spawndaemon -Q option. When no longer able to communicate with keepalive, spawndaemon displays a message indicating that it cannot find keepalive.

/etc/rc*.d
Set of directories where you register processes/daemons with keepalive and store the start and stop scripts for system processes/daemons.

/etc/spawndaemon.d
Directory where you store configuration files for each monitored process/daemon. The name of all process and group configuration files must begin with the prefix ka_. The following section describes configuration files in detail.

Configuration Files

All configuration files must reside in the /etc/spawndaemon.d directory and their names must begin with the prefix ka_. The user and group ID must be root for all configuration files. In addition, access control permission for each configuration file must be set to rw- r-- r-- (644).

The spawndaemon command uses different configuration file formats depending on whether an individual process/daemon or a group of processes/daemons is to be registered. However, all members of a group must have an individual process configuration file as well as being identified as a group member in a group configuration file.

Process Configuration Files

Each process configuration file for registering a single process/daemon must be formatted as follows:

[group_name]:full_path_to_executable:[arg_list]:[termwait]:uid:gid:[max_errors]:
[probation_period]:[minrespawn]:startup_script:[shutdown_script]:
[process_failure_recovery_script]:[node_failure_recovery_script]:[down_script]:[down_script_policy]

Each field must be separated by a colon; all fields must be listed on the same line.

group_name specifies the name of the keepalive group. This field is required for a process/daemon that is a member of a group and must be left blank if the process/daemon is not a member of a group. The group_name specified here must be identical to the group's group_name field in the group configuration file. The value you specify for group_name cannot exceed 16 characters in length.

full_path_to_executable specifies the full pathname of the process/daemon to be monitored.

arg_list is supplied when the -a option is used for registration. arg_list is all (or a subset of) the arguments used to distinguish this instance of the process/daemon from others of the same name. arg_list specifies all (or some) of the arguments used on the process/daemon command line(s) in the startup and recovery (restart) scripts, in the same order (with none missing in the sequence provided) as they appear on the command line(s), starting at the beginning of the argument list.

termwait is the time interval (in seconds) that keepalive gives the process/daemon to shut down before sending it a SIGKILL signal. The time interval defaults to two (2) seconds.

uid and gid specify the name of the user ID and group ID, respectively, which keepalive uses to spawn the registered process/daemon. keepalive calls setuid(2) with uid and setgid(2) with gid, respectively, when it forks the monitored process/daemon.

max_errors specifies the maximum number of errors allowed for a process/daemon before the keepalive daemon stops respawning it (thereby leaving the process/daemon in a down state). The number of errors must occur within the number of seconds specified by the probation_period field. Specifying zero (0) for either max_errors or probation_period causes keepalive to restart the process/daemon an infinite number of times. The default value for max_errors is 10 errors. The default value for probation_period is 300 seconds. A registered process/daemon's first error triggers its probation-period timer. If that process/daemon has fewer than max_errors errors occur within its probation period, the period expires. keepalive resets the process/daemon's error count to one (1) and the probation-period timer on the next failure.

minrespawn is the number of seconds that is considered the minimum respawn time. The minrespawn timer starts when the process/daemon is spawned. If the process/daemon exits in less than the respawn time, keepalive does not restart the process/daemon until the respawn time elapses. The default for minrespawn is zero (0) seconds.

startup_script is the name of the script that keepalive runs to start the process/daemon. However, keepalive also executes this script in the following situations: the process/daemon fails and no process_failure_recovery_script has been specified in the configuration file; a node failure occurs and neither a process_failure_recovery_script nor a node_failure_recovery_script has been specified in the configuration file. startup_script is the only required script. The script must reside in the /etc/keepalive.d directory.

shutdown_script is the name of an optional script that keepalive executes when the administrator terminates the process/daemon by running spawndaemon with the -k option. The shutdown script must perform all steps necessary for a controlled shut down. To support the shutdown script, keepalive sets $KEEPALIVE_ACTIVE_PID to the PID of the process/daemon for which the shutdown has been requested. If no shutdown script is specified, keepalive issues a SIGTERM signal, which the process/daemon must handle. If the process/daemon still exists after the termwait interval, keepalive sends the process/daemon a SIGKILL signal. The shutdown script must reside in the /etc/keepalive.d directory.

process_failure_recovery_script is the name of an optional script that keepalive executes if the process/daemon fails. keepalive also runs this script if the host node for the process/daemon fails, but no node_failure_recovery_script has been specified in the process configuration file. The process_failure_recovery_script must reside in the /etc/keepalive.d directory.

node_failure_recovery_script is the name of an optional script that keepalive executes in the event the host node for the process/daemon fails. The node failure recovery script must reside in the /etc/keepalive.d directory.

down_script is the name of the script that keepalive runs if the monitored process/daemon goes to the keepalive down state. Processes/daemons enter the down state when they fail more times than allowed within their specified probation period or they take themselves down with a down exit code (see -c down option). A down script should return the failed process/daemon's resources and perform any other cleanup-related tasks. keepalive attempts to run the down script on the same node on which the fail process/daemon was last running. However, if that node is down, keepalive tries the other nodes in turn until it exhausts the list of available nodes (logging each failure in the system log). To support the down script, keepalive sets $KEEPALIVE_LAST_PID to keepalive's last known PID for the process/daemon before it went to the down state. The down_script script must reside in the /etc/keepalive.d directory.

down_script_policy specifies whether keepalive should run the down script if the node on which the monitored process/daemon is running fails (thereby causing the process/daemon to go to the keepalive down state). A value of zero (0) for down_script_policy means that keepalive does not run the down script. A value of one (1), the default, indicates that keepalive runs the down script.

See Examples for an example of a process configuration file.

Group Configuration Files

A group configuration file must be formatted as follows:

<keepalive_group>:group_name
member_file:[wait_time]:[critical]
member_file:[wait_time]:[critical]
...
where <keepalive_group> and member_file each start a new line.

The first item in the first line of the group configuration file must be the string <keepalive_group> (the < and > symbols are required).

The group_name, which is limited to no more than 16 characters in length, specifies the name you want to assign the process/daemon group. The string you specify for group_name must match the group_name string found in the process configuration files for each of the group members.

Each line following the first line applies to a process/daemon that belongs to the group.

member_file specifies the name of a process configuration file for the group member. If you specify a group configuration file as a member_file, spawndaemon aborts the registration of the group.

If specified, wait_time is the number of seconds that keepalive delays before starting the next process/daemon in the group. When specified, wait_time must be a non-negative integer. If you do not specify wait_time, spawndaemon accepts the default value of 0 (zero seconds), meaning that there is to be no delay. Specifying a delay for the last process/daemon in the list has no effect.

critical is the criticality of the process/daemon within the keepalive group. Valid criticalities are zero (0) and one (1). A criticality of 0 (the default) indicates that keepalive should respawn only this process/daemon if it terminates. A value of 1 indicates that keepalive should respawn the entire group if this process/daemon terminates.

Examples

The content of the process configuration file for cron(1M) found at /etc/spawndaemon.d/ka_cron is as follows:

:/usr/sbin/cron:::root:sys::::cron_startup::cron_restart:::

This line indicates that cron:

  • Is not part of a group,
  • Has a full pathname of /usr/sbin/cron,
  • Uses root for the uid and sys for the gid,
  • Has both a startup_script and a process_failure_recovery_script,
  • Has no argument list, shutdown_script, node_failure_recovery_script, or down_script,
  • Takes the default values for termwait, max_errors, probation_period, and minrespawn.
For examples of startup scripts and other types of scripts, refer to the files located in /etc/keepalive.d on your cluster.

Process/Daemon States

Processes/daemons monitored by keepalive exist in one of the following states at all times:
start
The process/daemon is being started and goes to the ok state if the start is successful. If the start fails on all available members of the node set, the process/daemon goes to the dead state. If no members of the node set are available, the process/daemon remains in the start state.
ok
The process/daemon is running. A process/daemon in this state goes to the daemonize state if it has failed and has been registered without the -o option, to dead if it has failed and has been registered with the -o option or has rejected the node, to shutdown if spawndaemon is used to shutdown the process/daemon or if a critical group member has failed and it is a member of that group, or to down if it is a member of a group that includes a process/daemon that has gone to down or if it exits with a down exit code and the down feature (see -c down option) has been enabled.
dead
The process/daemon has failed and is not running. It goes to the respawn state if its max_errors have not been exceeded within the probation_period; if max_errors have been exceeded, it goes to down.
down
The process/daemon is not running. It has exceeded its max_errors within the probation_period or is part of a group that includes a member that has gone down for exceeding its max_errors. The -x option is used to go to the respawn state.
respawn
The process/daemon is being restarted. It goes to the ok state if the restart is successful or to dead if not successful on any of the available nodes in the node set. If no members of the node set are available, the process/daemon remains in the respawn state.
shutdown
The process/daemon is not running and has been shut down from the ok state (see -k option). The process/daemon either goes to the respawn state (if -x used with -k) or is unregistered (see -d, -p, -s, and -Q options).
daemonize
The process/daemon has failed and may have daemonized itself, so keepalive runs the daemonization recovery algorithm. If daemonization recovery succeeds, the process/daemon goes to the ok state; if it fails, the process/daemon goes to the dead state.

References

load_leveld(1), migrate(1), node_self(1), onnode(1), cluster(1M), init(1M), keepalive(1M), syslogd(1M), setuid(2) setgid(2) signal(3bsd), syslog(3G)
15 August 2001
Copyright 2001 Compaq Computer Corporation
Cluster-Tools Version 0.5.8

SourceForge Logo

Opensource.hp.com

HP Linux solutions

The Linux Clustering Information Center

This file last updated on Tuesday, 14-May-2002 09:35:07 UTC
privacy and legal statement