|
|
Printer-friendly version
spawndaemon(1M)
spawndaemon --
user-level interface to the keepalive daemon
Synopsis
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
[-n times] [ [-Z last_node] | [-Z round_robin] ]
[-o] [-a]
[-c down=exit_code] [-c reject=exit_code]
-r process_cfg_file
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
[ [-Z last_node] | [-Z round_robin] ]
[-o] [-a]
-R process_cfg_file pid
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
[ [-Z last_node] | [-Z round_robin] ]
[-o] -r group_cfg_file
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
[-n times]
-F node [ [-B node] . . . ]
[ [-Z F_node] | [-Z last_node] | [-Z round_robin] ]
[-o] [-a]
[-c down=exit_code] [-c reject=exit_code]
-r process_cfg_file
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
-F node [ [-B node] . . . ]
[ [-Z F_node] | [-Z last_node] | [-Z round_robin] ]
[-o] [-a]
-R process_cfg_file pid
spawndaemon [ [-i cluster_wide] | [-i node_list] ]
-F node [ [-B node] . . . ]
[ [-Z F_node] | [-Z last_node] | [-Z round_robin] ]
[-o] -r group_cfg_file
spawndaemon [-i cluster_wide] [-n times]
-U [-o] [-a]
[-c down=exit_code]
-r process_cfg_file
spawndaemon [-i cluster_wide]
-U [-o] [-a]
-R process_cfg_file pid
spawndaemon [-i cluster_wide]
-U [-o]
-r group_cfg_file
spawndaemon [-k] -x -D full_pathname
spawndaemon [-k] -x -P pid
spawndaemon [-k] -x -S slot
spawndaemon [-k] [-a] -d full_pathname [arg_list]
spawndaemon [-k] -p pid
spawndaemon [-k] -g group_name
spawndaemon [-k] -s slot
spawndaemon -q
spawndaemon [-k] -Q
spawndaemon -L [-v human [processes | keepalive] ]
spawndaemon -L [-v machine [processes | keepalive] ]
spawndaemon -X
spawndaemon -z max_processes
Description
The spawndaemon command provides the command-line interface to the
keepalive(1M)
daemon, which monitors processes and daemons, and
restarts those processes/daemons when they die. keepalive can
monitor processes and daemons individually and in groups.
Specifically, spawndaemon performs the following tasks:
- Processes the spawndaemon command-line options and the
associated configuration files.
- Sends messages to the keepalive daemon, which then performs
all process/daemon startup and monitoring tasks.
- Reads information from keepalive's monitored process table and
displays that information for the system administrator.
To configure a process or daemon to be monitored, perform the steps
described in the following paragraphs. For more information about
particular files and directories, see the Files
section later in this reference manual page.
- Create a process configuration file for the process or daemon
and store it in
the /etc/spawndaemon.d directory. For information about
configuration file syntax, owner and permission requirements, see the
Configuration Files section
later in this reference manual page.
- Create a script in one of the /etc/rc*.d directories, and
in that script call spawndaemon to register the process/daemon
with keepalive. Select the appropriate spawndaemon
syntax from the preceding Synopsis
section; the spawndaemon command references the configuration
file created in
the previous step. For information about spawndaemon options,
see the Options
section later in this reference manual page.
- Create a startup script in the /etc/keepalive.d directory. The
keepalive daemon calls this startup script to start the process
or daemon and (if so configured) to restart the process/daemon if it fails.
- Create other (optional) scripts to be used by keepalive
for handling your process/daemon. Examples of optional scripts include:
restart on process/daemon failure, restart on node failure, clean up when
the number of process/daemon failures exceeds the configured limit, and
shut down when directed manually by the spawndaemon command.
These scripts must also reside in the /etc/keepalive.d directory.
- Start the process/daemon, either by calling spawndaemon from
the command line of your system, or by restarting your system and
allowing the various scripts in the /etc/rc*.d directories and
the /etc/keepalive.d directory to start both keepalive and
your processes/daemons.
- Verify that your process/daemon is registered correctly by using the
spawndaemon -L and -v options to read the registration
information maintained by keepalive.
To configure a process/daemon group for monitoring, each group member must
be registered and have a process configuration file and startup/recovery
scripts similar to an individual process/daemon. In addition, a group
configuration file is required as described under
Configuration Files.
Options
The spawndaemon command uses the following options:
- -a
- When used for process/daemon registration, the -a option
specifies that a process/daemon be
registered with keepalive by name and argument list. Including
an argument list in the registration provides a method to distinguish
between processes/daemons having the same name.
The argument list is contained in the process configuration file
(see arg_list under
Configuration Files)
designated with the -r or -R options.
The -a option is intended to be used with processes that daemonize
themselves. The -a option cannot be used with group registrations.
When -a is specified, keepalive searches the
system process table using both the process name and arg_list
to find and register the process ID (PID) of the process after it has
daemonized. When -a is not used, only the process name is used in
the search of the process table to find the PID following daemonization.
arg_list does not have to contain all of the arguments
used by the
process/daemon command line(s) for startup and recovery (restart). However,
arg_list must contain one or more of the arguments in
the same order (with none missing in the sequence provided)
as they appear on the command line(s),
starting at the beginning of the argument list.
When the -a option is used to unregister a process/daemon
(see -d option) that has
been registered with an argument list, the arg_list
must be included with the -d option in the
spawndaemon command line.
- -B node
-
A single use of the -B option specifies the number of the (backup) node
on which the process/daemon is to be executed if the node specified with the
-F option is unavailable. For more information on the nodes in your
cluster, see cluster(1M).
The -B option can only be used if the -F option is also
used.
The -B option can be used more than once. If the node specified by
the first use of the -B option is unavailable, keepalive
uses the node specified by the second use of the
-B option, and so on. The -B option can be used up to 12
times. Refer to the -Z option for the available restart policies
involving the nodes specified with the -F and -B options.
- -c down=exit_code
-
This option tells keepalive to enable the down feature
for the process/daemon being registered. The down feature
allows a process/daemon to instruct keepalive to take that
process/daemon to the keepalive down state.
keepalive will not restart any process/daemon in the down
state. In order to restart such a process/daemon, the spawndaemon
command with the -x option must be used.
When a process/daemon exits to the down state, it must return the exit code you
specify in exit_code. exit_code must be an
integer other than zero (0). If exit_code is 0 (or if the
-c reject option has already been specified with
the same exit_code),
registration fails and the process/daemon is not started. keepalive
communicates the down exit code to the process/daemon through the
KEEPALIVE_PROCESS_DOWN environment variable, the value of which
is the exit_code you specify in the call to spawndaemon.
KEEPALIVE_PROCESS_DOWN is only set if spawndaemon is called with
the -c down=exit_code option. See the
Configuration Files section for information
about the down_script and down_script_policy
fields in the process configuration file used to specify and control
the execution of the script that
keepalive calls when the process/daemon goes to the down
state. This process/daemon down feature is not supported for group
registrations.
- -c reject=exit_code
-
This option tells keepalive to enable the node-rejection
feature for the process/daemon being registered. The
node-rejection feature allows a process/daemon having a resource
problem (for example, insufficient memory) to reject the node on which
it is running. In rejecting a node, the process/daemon sends
exit_code to keepalive, which causes keepalive to
fail the process/daemon over to another node. keepalive chooses
the new node based on the node selection policy specified by the -Z
option.
If a process/daemon rejects all of the nodes in the cluster,
keepalive clears the list of rejected nodes except for the most
recently rejected node. However, if that node is the only available
node, keepalive clears it also. With the rejected node list
cleared, keepalive begins anew trying to move the process/daemon
to another node. The error counter for the process/daemon
is not reset. Each node rejection counts as a failure; therefore,
keepalive eventually takes the process/daemon to the down
state (when max_errors failures occur within
probation_period seconds as specified in the process
configuration file).
When a process/daemon rejects a node, it returns the exit code that
you specify in exit_code. exit_code must be an
integer other than zero (0). If exit_code is 0 (or
if the -c down option has already been specified with
the same exit_code),
registration fails and the process/daemon is not started. keepalive
communicates the node rejection exit code to the process/daemon through
the KEEPALIVE_NODE_REJECT environment variable, the value of which
is the exit_code you specify on the call to spawndaemon.
KEEPALIVE_NODE_REJECT is only set if spawndaemon is called
with the -c reject=exit_code option.
If a node failure recovery script has been specified in the
process configuration file, keepalive runs that script
to recover from a node rejection. If no node failure recovery script
has been specified, keepalive runs the process failure recovery
script (if specified) or the startup script. See the Configuration
Files section for information about specifying the script that
keepalive calls when the process/daemon rejects a node.
The node-rejection feature is not supported for group registrations.
- -d full_pathname [arg_list]
-
Unregisters the named process/daemon. When specifying
full_pathname,
use a full path name as designated by the full_path_to_executable
field in the process configuration file. When -d is used with
the -a option, arg_list must be specified.
See the -a option for details.
- -D full_pathname
-
Identifies a process/daemon by name to the -x option. When specifying
full_pathname, use a full path name as designated by the
full_path_to_executable field in the process configuration file.
- -F node
-
Specifies the number of the (favored) node on which the process/daemon is to be
executed. keepalive pins the process/daemon on the specified node so
that the process/daemon cannot be migrated by using the
load_leveld(1)
utility; the pinned process/daemon ignores any migration request.
cluster(1M)
can be used to get information on the nodes in your cluster.
The nodes designated with the -F and -B options form the
set of nodes on which the process/daemon is allowed to run. The restart
policy option (-Z) determines how nodes are selected from this set
when the process/daemon needs to be restarted or fails to (re)start
on a selected node.
- -g group_name
-
Unregisters the processes/daemons in the group specified by
group_name, where group_name is defined in the
group configuration file for the group.
- -i cluster_wide | -i node_list
-
Enable idempotency enforcement on a cluster-wide or node-list basis.
The -F option is required and -B is optional when
-i node_list is used. Idempotency enforcement is performed at
registration time and persists after registration is completed.
Idempotency for the registered process/daemon instance continues
to be enforced whether or not the -i option is used
on subsequent attempts to register additional instances of the
same process/daemon. A special spawndaemon exit code of 2
(see Return Values) indicates an
idempotency violation.
When cluster_wide is specified, keepalive's monitored
process table is searched. In the case of registering a single process/daemon,
if the process/daemon is already
registered, spawndaemon counts this as an idempotency violation and
does not register the new instance. In the case of registering a group,
if any member of the group is already registered for that group,
then spawndaemon
counts this as an idempotency violation and does not register the new
group instance. Otherwise, in both cases, registration proceeds as normal.
When node_list is specified, keepalive's monitored
process table is searched. In the case of registering a single
process/daemon, the node list for the new
registered instance must be mutually exclusive with the existing registered
instances of the same process/daemon. If the node lists are not mutually
exclusive, spawndaemon counts the registration
attempt as an idempotency violation and does not register the new instance.
In the case of registering a group, the monitored process table is searched
to verify that the new group instance has a node list that is mutually
exclusive with any other group instances of the same name already registered
with keepalive.
If a group instance of the same name is already registered with a
node list that is not mutually exclusive with the new group instance,
spawndaemon counts this as an idempotency violation and does
not register the new group instance.
Group registrations, both cluster wide and node list, are only affected
by other registered instances of the same group.
When spawndaemon attempts to register a group of processes and
discovers that one or more processes within the group have been
previously registered as non-grouped processes, it still registers all
the processes specified in the group. If you want two groups containing
the same set of processes to run at the same time, be sure to specify
different names for the two groups.
Group registrations can prevent single process registrations. If a group
is registered and the -i cluster_wide option is used to attempt
to register a single process/daemon that is a
member of the group, the registration attempt is considered an idempotency
violation. If a group is
registered and the -i node_list option is used to attempt to
register a single process/daemon with the same node list as another
instance of the process/daemon that belongs to the group, the registration
attempt is considered an idempotency violation.
- -k
-
Invokes the shutdown script (named in the process configuration
file) for the process/daemon identified by another option in the command line.
Options for identifying the process/daemon include -d, -p,
-s, -D, -P, and -S.
If no shutdown script exists for the process/daemon, a SIGTERM signal is
sent to the process/daemon.
The termwait option specified in the process configuration
file can be used to define a time
interval (in seconds) that keepalive gives the process/daemon to
shut down before sending it a
SIGKILL signal. The time interval defaults to two seconds.
keepalive attempts to run the shutdown script on the node where the
process/daemon was last active. If that node is down, the other nodes
in the cluster are tried to fork/exec the shutdown script until a node
is found or the list of available nodes is exhausted. As
each attempt is made, warning messages are posted to the system log.
If it is impossible to run the shutdown script, an error message is posted
to the system log and system console.
When -k is used with -g to shut down a group, shutdown scripts
are used for those group members that have them; SIGTERM is used
for those group members that do not have a shutdown script.
- -L
-
Displays the state of the monitored process table in an abbreviated form.
If the keepalive daemon has been quiesced (see -q option),
a message to that effect follows the table. The -v (verbose) options
display/read the full details of the monitored process table.
- -n times
-
Used with the -r option to specify the number of times
keepalive registers and starts a process/daemon. If the -i
option is also specified, the processes/daemons are started only if no copies
of the process/daemon are already running (so you can make sure that all old
copies of a process/daemon are gone before starting new ones). The -n
option cannot be used with group registrations.
- -o
-
A process registered with the -o option
must not daemonize itself. This option instructs keepalive
to not perform daemonization recovery, thereby optimizing keepalive's
performance for processes of this type. If -o is used for group
registration, none of the group members can be daemons.
- -p pid
-
Specifies the process ID of a process/daemon to be unregistered.
- -P pid
-
Identifies a process/daemon by process ID to the -x option.
- -q
-
Quiesces keepalive; that is, prevents keepalive from
monitoring processes or daemons. keepalive continues to maintain
its internal status. spawndaemon is still functional while
keepalive is quiesced; however, monitored processes/daemons are not
restarted if they die with keepalive in this quiesced state.
Use the -X option to resume normal keepalive operation.
- -Q
-
Shuts down the keepalive daemon, cleans up, and exits. As part
of the cleanup procedure, -Q clears keepalive's monitored
process table, which means that all process/daemon registrations are
lost. This action does not affect the processes or daemons themselves.
They continue to run and are unaffected. Use the -k option
with -Q if you also want to shut down all of the monitored
processes/daemons.
To shut down keepalive, but leave the monitored process table intact,
send keepalive a SIGTERM signal so it performs a controlled exit.
For keepalive to remain down, /etc/inittab must be edited
to remove the keepalive entries.
- -r process_cfg_file
-
Registers the process/daemon named in the process configuration file
specified in process_cfg_file. After the process/daemon is
registered with keepalive, keepalive starts the process or
daemon by calling its startup script. The process configuration file,
which contains the process/daemon's characteristics, must be stored
in the /etc/spawndaemon.d directory and its name must begin with
the prefix ka_. The -r and -R options are
mutually exclusive.
- -r group_cfg_file
-
Registers all the processes/daemons named in the group configuration file
specified by group_cfg_file. As soon as each process/daemon
is registered, keepalive starts the process/daemon
by calling that process/daemon's startup script.
When spawndaemon attempts to register a group of processes/daemons and
discovers that one or more processes/daemons within the group have been
previously registered as non-grouped processes/daemons, it still registers all
the processes/daemons specified in the group. If spawndaemon attempts to
register a group of processes/daemons and discovers that an identically-named
group of processes/daemons has already been registered, it does not register
the new group, but exits with an error (return value of 1).
To register two or more groups containing the same set of
processes/daemons, specify different group names for the groups.
You must create the group configuration file and a
process configuration file for each group member in the
/etc/spawndaemon.d directory. For information about the required
format of the configuration files,
see Configuration Files.
- -R process_cfg_file pid
-
Registers an already-running process/daemon with keepalive. You must
provide the name of the process/daemon to be registered in the
process configuration file specified by process_cfg_file. The
configuration file must be stored in the /etc/spawndaemon.d
directory. For information about configuration file naming and syntax
requirements, and required access control permissions, see
Configuration Files.
You must specify the process ID of the running process/daemon in
pid. If
the specified process/daemon with that PID is already registered,
spawndaemon logs an error and no new registration takes place.
- -s slot
-
Specifies the slot number (in the monitored process table)
of a process/daemon to be unregistered. Process/daemon slot numbers
can be read with the
-v human processes and -v machine processes options.
- -S slot
-
Identifies a process/daemon by monitored process table slot number
to the -x option. Process/daemon slot numbers can be read with the
-v human processes and -v machine processes options.
- -U
-
Directs keepalive to spawn the process/daemon without pinning it to a
particular node. Unpinned processes/daemons can be migrated from one node to
another whenever they are sent a migration request, such as when
the load_leveld(1) utility
is active. If -U is not used, the process/daemon being registered
is pinned to a node when spawned (see -Z option).
The default migration handler performs the migration; however,
the default handler does not migrate a process/daemon's kernel objects (file
descriptors and so on) to the new node. Those objects remain on the
original node. If the original node fails, the migrated process/daemon can no
longer access its kernel objects, which can lead to unpredictable
behavior from the process/daemon. To preserve your process/daemon's kernel objects
during migration, implement your own migration handler, and
have that handler destroy and rebuild kernel objects (such as file
descriptors) during migration.
To run multiple instances of the same process/daemon, you must be
consistent in how you use the -U
option. You cannot mix pinned and unpinned instances with the same name;
keepalive requires such consistency in order to perform daemonization
recovery on an instance that has failed. When the -o option is used
to register an instance, this consistency restriction does not apply
(processes registered with -o do not daemonize).
- -v human [processes | keepalive]
-
When used with the -L option, displays the full contents of
keepalive's monitored process table in a
human-readable format. The following fields are displayed for each
process/daemon registered with keepalive when the processes
option is specified:
| Monitored Process Table Process/Daemon Fields |
Field Description |
| state |
The keepalive state for the process/daemon. See
Process/Daemon States for a description of each state. |
| pid
| The current process identification number (PID) of the process/daemon.
| | node number
| The number of the node on which the process/daemon is running. |
| full path to process
| The complete path to the process/daemon. |
| argument list |
The arguments (if any) used by keepalive to distinguish this process/daemon
from others by the same name. |
| child of keepalive
| Set TRUE if the process is a child of keepalive; otherwise
(such as when daemonization recovery is underway), set FALSE. |
| daemonization recovery
| Set TRUE if the process has daemonized itself; otherwise, set FALSE.
|
| pinned |
Indicates whether the process/daemon has been designated to run on one or
more specific nodes in the cluster. If set TRUE, the process/daemon is pinned
to one or more nodes. If set FALSE, the process/daemon can
float (migrate) among the nodes in the cluster. |
| lastexeced |
Indicates when the process/daemon was last started. |
| process first died |
Indicates the first time the process/daemon stopped before being
restarted. |
| process last died |
Indicates the last time the process/daemon stopped before being
restarted |
| min. respawn |
Specifies the number of seconds the process/daemon must run before it
is eligible for restarting. |
| num. errors |
The number of errors (such as process/daemon failures) that have
occurred during the current probation period, which starts when the
process/daemon fails and the error count is set to one (1). The error
count includes process/daemon failures, node rejections by the
process/daemon (see -c reject option), and node failures. |
| total errors |
Specifies the total number of errors since the process/daemon was
first started. See num_errors for the events included in the error count. |
| max. errors during probation |
Specifies the maximum number of errors (process/daemon failures)
allowed before keepalive will no longer respawn the process/daemon
(leaving the process/daemon in the down state). The maximum number
of errors must occur during the specified probation period in order
for the process/daemon to be left in the down state. |
| probation period |
Specifies the time, in seconds, during which the number of errors specified
by max_errors_during_probation must occur in order for the process/daemon
to be taken to the down state. |
| registration policy |
One of the following methods by which the process/daemon is registered:
Name, meaning keepalive looks for the process/daemon by
name (the -a and -o options were not used to register the
process/daemon); Argument List, meaning keepalive looks for the
process/daemon by name and argument list (the -a option was used);
PID, meaning keepalive looks for the process/daemon by process ID (-o
option was used).
|
| node selection policy |
Identifies the node selection policy as specified by the -Z
option. |
| favored node |
The node on which the process/daemon is executed, as specified
by the -F option. If no node is specified, this value is
None. |
| backup nodes
| Specifies the nodes on which the process/daemon is executed if the
favored node is unavailable. If no nodes are specified, this value is None.
| | rejected nodes
| A list of nodes that the process/daemon has rejected or the
keepalive node selection policy has rejected.
| | termwait
| The time interval (in seconds) that keepalive gives the
process/daemon to shut down before sending it a SIGKILL signal. |
| euid |
The user identification number of the process/daemon. |
| egid |
The group identification number of the process/daemon. |
| startup script |
The name of the script that keepalive executes when it starts the
process/daemon. |
| shutdown script
| The name of the script that keepalive executes when it shuts down the
process/daemon.
|
| process failure recovery script |
The name of the script that keepalive executes when it restarts the
process/daemon after it fails. |
| node failure recovery script |
The name of the script that keepalive executes when it restarts a
process/daemon whose node has failed. |
| down script |
The name of the script that keepalive executes when a process/daemon
enters the down state. |
| group |
The name of the registration group to which the process/daemon
belongs. If the process/daemon does not belong to a group, None is displayed. |
| critical group process |
Set to TRUE or FALSE to indicate whether or not the process/daemon is
critical to its group. Set to N/A if the process/daemon is not a member of
a group. |
| reject node exit code |
Exit code specified by -c reject option. Set to None if the
process/daemon was not registered with the -c reject option. |
| down exit code |
Exit code specified by -c down option. Set to None if the
process/daemon was not registered with the -c down option. |
| exit status returned
| If a process/daemon is not running, this is the exit code associated
with the most recent failure/exit. If a process/daemon is running, None is
reported.
| | last pid
| The PID of the last failed process/daemon in this slot.
| | slot
| The slot number of the process/daemon in this table.
|
The following fields are displayed when the keepalive
option is specified:
| Monitored Process Table Keepalive Fields |
Field Description |
| running |
Set TRUE if keepalive is running; otherwise, set FALSE. |
| quiesce flag
| Set TRUE if keepalive is currently quiesced with the -q
option; otherwise, set FALSE. |
| pid
| The process identification number for keepalive. |
| node number
| The number of the node on which keepalive is running. |
| registered processes
| The total number of processes/daemons currently registered with
keepalive, which equates to the number of entries in the
monitored process table. |
| table size
| The current number of slots in the monitored process table. Each registered
process/daemon uses one slot in the table. "table size" is less than or
equal to "max. possible processes"; if less than "max. possible processes,"
it is because keepalive has not allocated memory for the maximum
number of slots. |
| max. possible processes
| The maximum number of processes/daemons that can be registered, which
equates to the maximum size of the monitored process table. This is a minimum
of 200 and can be increased with the -z option. |
| polling
| Set TRUE or FALSE, indicating whether keepalive detects a
process/daemon failure by polling rather than via child process adoption
(that is, on receipt of the SIGCHLD signal). |
| polling interval
| The time in seconds that keepalive uses as a polling interval.
This can be controlled with the -t option of
keepalive(1M). |
| primary node
| The primary node on which the keepalive process is executed.
This can be controlled with the -P option of
keepalive(1M). |
| secondary node
| The secondary node on which the keepalive process is executed
when the primary node is unavailable.
This can be controlled with the -P option of
keepalive(1M). |
- -v machine [processes | keepalive]
-
Displays the full contents of keepalive's monitored process table in a
format that is parsible by shell utilities or by a C program. The
record for each process/daemon is listed on a single line and is
terminated with a newline character. Each field in the record appears in the
format FIELD="VALUE" or
FIELD=VALUE. Each field/value pair is terminated by
a semi-colon (;). VALUE is enclosed within double quotes when
keepalive provides string values (as when the semi-colon
delimiter appears in the value of a field and the field is a string).
In fields that contain lists of values, a comma separates each
VALUE or "VALUE". The following field/value pairs are
returned for each process/daemon when the processes option is specified:
| Field No. |
Field="Value" or Field=Value |
| 1
| state="<current state of the process/daemon>";
| | 2
| pid="<pid of process/daemon>" or "None";
| | 3
| node_number="<number of node running process/daemon>" or "None";
| | 4
| full_path_to_process="<full pathname for process/daemon>";
| | 5
| arg_list="<list of arguments>" or "" (null string);
| | 6
| child_of_keepalive=TRUE or FALSE;
| | 7
| daemonization_recovery=TRUE or FALSE;
| | 8
| pinned=TRUE or FALSE;
| | 9
| lastexeced="<time process/daemon was last exec'ed>";
| | 10
| process_first_died="<time process/daemon first died>" or "Never";
| | 11
| process_last_died="<time process/daemon last died>" or "Never";
| | 12
| minrespawn=<minimum time allowed between respawns>;
| | 13
| num_errors=<number of process/daemon failures within current
probation_period>;
| | 14
| total_errors=<total number of failures for this process/daemon>;
| | 15
| max_errors_during_probation=<maximum number of failures allowed within
probation_period>;
| | 16
| probation_period=<probation period in seconds>;
| | 17
| registration_policy="<method by which process/daemon is registered>";
| | 18
| node_selection_policy="<node selection policy>";
| | 19
| favored_node="<node designated with -F option>" or "None";
| | 20
| backup_nodes="<node designated with -B option>,
<node designated with -B option>, ..." or "None";
| | 21
| rejected_nodes="<rejected node>, <rejected node>, ..." or "None";
| | 22
| termwait=<time allowed (in seconds) to shut down before sending
SIGKILL>;
| | 23
| euid=<user ID of process/daemon>;
| | 24
| egid=<group ID of process/daemon>;
| | 25
| startup_script="<pathname of startup script>";
| | 26
| shutdown_script="<pathname of shutdown script>" or "None";
| | 27
| process_failure_recovery_script="<pathname of process failure recovery
script>" or "None";
| | 28
| node_failure_recovery_script="<pathname of node failure recovery script>"
or "None";
| | 29
| down_script="<pathname of down script>" or "None";
| | 30
| group="<group name>" or "None";
| | 31
| critical_group_process="TRUE" or "FALSE" (if a group member) or "N/A";
| | 32
| reject_exit_code="<exit code used to reject host node>" or "None";
| | 33
| down_exit_code="<exit code used to take process/daemon down>" or "None";
| | 34
| exit_status_returned="<exit code if process/daemon not running>"
or "None";
| | 35
| last_pid="<pid of last failed process/daemon>" or "None";
| | 36
| slot=<monitored process table slot number of this process/daemon>;
|
Refer to the -v human option for a description of the fields for
each process/daemon.
The following field/value pairs are returned when the keepalive
option is specified:
| Field No. |
Field="Value" or Field=Value |
| 1
| running=TRUE or FALSE;
| | 2
| quiesce_flag=TRUE if keepalive quiesced, FALSE otherwise;
| | 3
| pid="<pid of keepalive>" or "None";
| | 4
| node_number="<number of node running keepalive>" or "None";
| | 5
| registered_processes=<current number of registered processes>;
| | 6
| table_size=<current size of monitored process table>;
| | 7
| max_possible_processes=<maximum number of processes/daemons that
can be registered (maximum size of monitored process table)>;
| | 8
| polling=TRUE if keepalive detects process/daemon failure via polling,
FALSE if SIGCHLD is used to detect failure;
| | 9
| polling_interval=<keepalive polling interval>;
| | 10
| primary_node="<primary node on which keepalive is executed>" or "None";
| | 11
| secondary_node="<secondary node on which keepalive is executed>" or "None";
|
Refer to the -v human option for a description of the fields for
keepalive.
- -x
-
If a process/daemon is down for any reason (for example, because it has
reached its maximum allowed error count within the specified probation period),
its error (failure) count is cleared and the process/daemon is restarted.
For information about max_errors and
probation_period, see the
Configuration Files
section of this reference manual page.
If a process/daemon is running, the -x option clears the error count
so it appears the process/daemon has not failed. If -k is used
with -x, the process/daemon is shut down and restarted, which can
be used to restart a process/daemon that appears hung.
- -X
-
Resumes normal keepalive operation. Call spawndaemon with
this option after you quiesce keepalive with the -q option.
- -z max_processes
-
Increases the size of the monitored process table (the maximum number of
processes/daemons that keepalive can monitor) to the value
specified in max_processes. By default, keepalive
can monitor 200 processes/daemons; this option can only be used to
increase the maximum size of the table.
- -Z F_node | round_robin | last_node
-
Specifies a node selection policy for the process/daemon.
A node selection policy only has
meaning if the spawned process/daemon is to be pinned to a node; therefore,
do not use this option if the spawned process/daemon is not to be pinned
(see -U option). If the -Z and -U options are
not used to register individual processes/daemons, the node selection policy
defaults to round_robin. For group registrations, the node selection policy
defaults to last_node.
If you specify F_node, the node specified with the -F option
is used first for the restart attempt. If this favored node is not available,
the node(s) designated with the -B option are tried next. If none
of the designated nodes are available, the restart attempt is delayed
until one of the nodes becomes available.
Use of F_node is not valid unless the -F option is also used.
If you specify round_robin without using the -F and -B
options, keepalive picks the first
available node in the cluster on which to restart the process/daemon,
unless the last restart attempt on that node failed, in which
case keepalive picks the next available node. keepalive maintains
a list of visited nodes; when all nodes
in the list have been visited, the list is cleared and reused until the
process/daemon reaches max_errors within the
probation_period. Specify round_robin if you do not
care which node runs your process/daemon. Note however, that the -F and
-B options can be used with round robin to designate the
nodes used in the restart attempts. The round_robin node selection policy
offers the highest level of availability since the node on
which the process/daemon most recently failed/exited is avoided.
If you specify last_node, keepalive first tries to restart
the process on the last node on which it was running before trying to
restart it on any other node. If the restart attempt fails, the nodes
designated with the -F and -B options are used on subsequent
restart attempts in the order specified on the spawndaemon command line.
If the -F option is used without the -B option, the restart
attempts are repeated on the -F node as long as it is available. If
no -F (or -B) option is used, the restart is attempted on
an available node in a round robin fashion. The last_node restart
policy is useful in those situations where resources (such as a shared
memory segment) may persist across process/daemon failures.
Return Values
The spawndaemon command returns the following values:
- 0 - The operation has completed successfully.
- 1 - A fatal error has occurred.
- 2 - An idempotency violation has occurred.
Files
- /dev/keepalivecfg
-
Named pipe that keepalive uses for receiving commands from
the spawndaemon utility.
- /etc/keepalive.d
-
Directory where you store the scripts for managing monitored processes/daemons.
For each monitored process/daemon, you must
provide a startup script.
Optionally, you can provide additional scripts for keepalive to
call when other events occur, such as process/daemon failure, node
failure, or when the process/daemon goes to the keepalive down
state (meaning, for example, that the monitored process/daemon has produced
more errors than allowed within its specified probation period and,
therefore, is not restarted by keepalive). For more
information about optional scripts, see the
Configuration Files
section later in this reference manual page.
The user and group ID must be root for all scripts.
In addition, access control permission
for each script must be set to rwx r-x r-x (755).
- /etc/keepalive.d/keepalive.data
-
keepalive stores the state of all monitored processes and
daemons in this memory-mapped data file referred to as the monitored
process table. If you remove
/etc/keepalive.d/keepalive.data, the keepalive daemon
should be shut down with the spawndaemon -Q option.
When no longer able to communicate with
keepalive, spawndaemon displays a message indicating that
it cannot find keepalive.
- /etc/rc*.d
-
Set of directories where you register processes/daemons with keepalive
and store the start and stop scripts for system processes/daemons.
- /etc/spawndaemon.d
-
Directory where you store configuration files for each monitored
process/daemon. The name of all process and group configuration files
must begin with the prefix ka_. The following section
describes configuration files in detail.
Configuration Files
All configuration files must reside in the /etc/spawndaemon.d
directory and their names must begin with the prefix ka_. The user
and group ID must
be root for all configuration files. In addition,
access control permission for each configuration file must be set to
rw- r-- r-- (644).
The spawndaemon command uses different configuration
file formats depending on whether an individual process/daemon or a group of
processes/daemons is to be registered. However, all members of a group must
have an individual process configuration file as well as being identified
as a group member in a group configuration file.
Process Configuration Files
Each process configuration file for registering a single process/daemon
must be formatted as follows:
[group_name]:full_path_to_executable:[arg_list]:[termwait]:uid:gid:[max_errors]:
[probation_period]:[minrespawn]:startup_script:[shutdown_script]:
[process_failure_recovery_script]:[node_failure_recovery_script]:[down_script]:[down_script_policy]
Each field must be separated by a colon; all fields must be listed on the
same line.
group_name specifies the name of the keepalive
group. This field is required for a process/daemon that is a member
of a group and must be left blank if the process/daemon is
not a member of a group. The group_name specified here
must be identical to the group's group_name field in the group
configuration file. The value you specify for group_name
cannot exceed 16 characters in length.
full_path_to_executable specifies the full pathname of the
process/daemon to be monitored.
arg_list is supplied when the -a option is used for
registration. arg_list is all (or a subset of) the arguments
used to distinguish this instance of the process/daemon from others of the same
name. arg_list specifies all (or some) of the arguments used
on the process/daemon command line(s) in the startup and recovery (restart)
scripts, in the same order (with none missing in the sequence provided)
as they appear on the command line(s), starting at the beginning of the
argument list.
termwait is the time interval (in seconds) that keepalive
gives the process/daemon to shut down
before sending it a SIGKILL signal. The time interval defaults to
two (2) seconds.
uid and gid specify the name of the user ID
and group ID, respectively, which keepalive uses to spawn the registered
process/daemon.
keepalive calls
setuid(2)
with uid and
setgid(2)
with gid, respectively, when it forks the monitored
process/daemon.
max_errors specifies the maximum number of errors allowed for a
process/daemon before the keepalive daemon stops respawning
it (thereby leaving the process/daemon in a down state). The number of
errors must occur within the number of seconds specified by the
probation_period field. Specifying zero (0) for either
max_errors or probation_period causes
keepalive to
restart the process/daemon an infinite number of times. The default
value for max_errors is 10 errors. The default value for
probation_period is 300 seconds. A registered
process/daemon's first error triggers its probation-period timer. If
that process/daemon has fewer than max_errors errors occur
within its probation period, the period expires. keepalive
resets the process/daemon's error count to one (1) and the probation-period
timer on the next failure.
minrespawn is the number of seconds that is considered the
minimum respawn time. The minrespawn timer starts when the process/daemon
is spawned. If the process/daemon exits in less than the
respawn time, keepalive does not restart the process/daemon
until the respawn
time elapses. The default for minrespawn is zero (0) seconds.
startup_script is the name of the script that keepalive
runs to start the process/daemon. However, keepalive also
executes this script in the following situations: the process/daemon
fails and no process_failure_recovery_script has been
specified in the configuration file; a node failure occurs and neither a
process_failure_recovery_script nor a
node_failure_recovery_script has been
specified in the configuration file. startup_script
is the only required script.
The script must reside in the /etc/keepalive.d directory.
shutdown_script is the name of an optional script that
keepalive executes when the administrator terminates the
process/daemon by running spawndaemon with the -k option. The
shutdown script must perform all steps necessary for a controlled
shut down. To support the shutdown script, keepalive sets
$KEEPALIVE_ACTIVE_PID to the PID of the process/daemon for which the
shutdown has been requested. If no shutdown script is specified,
keepalive issues a SIGTERM signal, which the process/daemon
must handle. If the process/daemon still exists after the
termwait interval, keepalive sends the process/daemon
a SIGKILL signal. The
shutdown script must reside in the /etc/keepalive.d directory.
process_failure_recovery_script is the name of an optional
script that keepalive executes if the process/daemon fails.
keepalive also runs this script if the host node for the
process/daemon fails, but no
node_failure_recovery_script has been specified in the
process configuration file. The process_failure_recovery_script
must reside in the /etc/keepalive.d directory.
node_failure_recovery_script is the name of an optional script
that keepalive executes in the event the host node for the
process/daemon fails. The node failure recovery script must reside
in the /etc/keepalive.d directory.
down_script is the name of the script that keepalive runs
if the monitored process/daemon goes to the keepalive down
state. Processes/daemons enter the down state when they fail more times
than allowed within their specified probation period or they take
themselves down with a down exit code (see -c down option). A down script
should return the failed process/daemon's resources and perform any
other cleanup-related tasks. keepalive attempts to run the down
script on the same node on which the fail process/daemon was last
running. However, if that node is down, keepalive tries the
other nodes in turn until it exhausts the list of available nodes
(logging each failure in the system log). To support the down script,
keepalive sets
$KEEPALIVE_LAST_PID to keepalive's last known PID for the process/daemon
before it went to the down state. The down_script script
must reside in the /etc/keepalive.d directory.
down_script_policy specifies whether keepalive should run
the down script if the node on which the monitored process/daemon
is running fails (thereby causing the process/daemon to go to the
keepalive down state). A value of zero (0) for
down_script_policy
means that keepalive does not run the down script. A value of
one (1), the default, indicates that keepalive runs the down
script.
See Examples for an example of a
process configuration file.
Group Configuration Files
A group configuration file must be formatted as follows:
<keepalive_group>:group_name
member_file:[wait_time]:[critical]
member_file:[wait_time]:[critical]
...
where <keepalive_group> and member_file each
start a new line.
The first item in the first line of the group configuration
file must be the string <keepalive_group> (the < and
> symbols are required).
The group_name, which is limited to no more than 16
characters in length, specifies the name you want to assign the
process/daemon group. The string you specify for group_name
must match the group_name string found in the
process configuration files for each of the group members.
Each line following the first line applies to a process/daemon that belongs
to the group.
member_file specifies the name of a process
configuration file for the group member.
If you specify a group configuration file as a member_file,
spawndaemon aborts the registration of the group.
If specified, wait_time is the number of seconds that
keepalive delays before starting the next process/daemon in
the group. When specified, wait_time must be a non-negative
integer. If you do not specify wait_time, spawndaemon
accepts the default value of 0 (zero seconds), meaning that there is
to be no delay. Specifying a delay for the last process/daemon in
the list has no effect.
critical is the criticality of the process/daemon within the
keepalive group. Valid criticalities are zero (0) and one (1). A
criticality of 0 (the default) indicates that keepalive
should respawn only this process/daemon if it
terminates. A value of 1 indicates that keepalive should
respawn the entire group if this process/daemon terminates.
Examples
The content of the process configuration file for
cron(1M)
found at /etc/spawndaemon.d/ka_cron is as follows:
:/usr/sbin/cron:::root:sys::::cron_startup::cron_restart:::
This line indicates that cron:
- Is not part of a group,
- Has a full pathname of /usr/sbin/cron,
- Uses root for the uid and sys for the gid,
- Has both a startup_script and a process_failure_recovery_script,
- Has no argument list, shutdown_script,
node_failure_recovery_script, or down_script,
- Takes the default values for termwait, max_errors,
probation_period, and minrespawn.
For examples of startup scripts and other types of scripts, refer to the
files located in /etc/keepalive.d on your cluster.
Process/Daemon States
Processes/daemons monitored by keepalive exist in one of the following states
at all times:
- start
- The process/daemon is being started and goes to the ok state if the
start is successful. If the start fails on all available members of the
node set, the process/daemon goes to the
dead state. If no members of the node set are available,
the process/daemon remains in the start state.
- ok
- The process/daemon is running. A process/daemon in this state goes
to the daemonize state if it has failed and has been registered
without the -o option, to dead if it has failed
and has been registered with the -o option or has rejected
the node, to shutdown if spawndaemon is used
to shutdown the process/daemon or if a critical group member has failed
and it is a member of that group, or to down if it is a member of
a group that includes a process/daemon that has gone to down or if
it exits with a down exit code and the down feature (see -c down option)
has been enabled.
- dead
- The process/daemon has failed and is not running. It goes to the
respawn state if
its max_errors have not been exceeded within the probation_period;
if max_errors have been exceeded, it goes to down.
- down
- The process/daemon is not running. It has exceeded its max_errors
within the probation_period or is part of a group that includes
a member that has gone down for exceeding its max_errors.
The -x option is used to go to the respawn state.
- respawn
- The process/daemon is being restarted. It goes to the ok state
if the restart is successful or to dead if not successful on any of
the available nodes in the node set. If no members of the node set
are available, the process/daemon remains in the respawn state.
- shutdown
- The process/daemon is not running and has been shut down from the
ok state (see -k option). The process/daemon either goes to the
respawn state (if -x used with -k) or is
unregistered (see -d, -p, -s, and -Q options).
- daemonize
- The process/daemon has failed and may have daemonized itself,
so keepalive runs the
daemonization recovery algorithm. If daemonization recovery succeeds,
the process/daemon goes to the ok state; if it fails, the process/daemon
goes to the dead state.
References
load_leveld(1),
migrate(1),
node_self(1),
onnode(1),
cluster(1M),
init(1M),
keepalive(1M),
syslogd(1M),
setuid(2)
setgid(2)
signal(3bsd),
syslog(3G)
15 August 2001
Copyright 2001 Compaq Computer Corporation
Cluster-Tools Version 0.5.8
|