|
|
This project is developing a common infrastructure for
Linux clustering by extending the CLuster Membership Subsystem
("CLMS") and Internode Communication Subsystem ("ICS") of the
OpenSSI project.
The Sourceforge.net project summary page is located
here.
Both the SSI and CI code are being released under the
GNU General Public
Licence (GPL), version 2. This is the same license used by the
Linux kernel.
- Get Cluster Infrastruture 0.9.9 for
Red Hat 9
- Run a Virtual CI Cluster with UML
and the 2.6.8.1 Kernel
- To checkout CI, first login to the CVS servers. Press Enter when
prompted for a password:
$ cvs -d:pserver:anonymous@ci-linux.cvs.sourceforge.net:/cvsroot/ci-linux login
- Many developers work with a Fedora Core 1 version of the
kernel.
You can access it by checking out the OPENSSI-RH branch
of the repository:
$ cvs -z3 -d:pserver:anonymous@ci-linux.cvs.sourceforge.net:/cvsroot/ci-linux co -r OPENSSI-RH ci
- The future of the CI kernel is on 2.6. You can access it by
checking out the trunk of the repository:
$ cvs -z3 -d:pserver:anonymous@ci-linux.cvs.sourceforge.net:/cvsroot/ci-linux co ci
- SourceForge has provided some nice documentation about their CVS services.
- Sign up to receive checkin messages.
You can find contributed patches and such here.
See the mailing list archive for context.
- Discussion list for developers and users
(ci-linux-devel)
- Notification of CVS checkins
(ci-linux-checkins)
- Provide a cluster infrastructure that can be used as the
basis for many different cluster products and projects,
including HA failover, load leveling, parallel filesystem,
HPC and Single
System Image (SSI).
- Provide a membership service functional enough for each of
the cluster environments with an easy way to add subsystems
to the environment and a set of APIs to build cluster
subsystems and cluster-aware applications.
- Modularity, so other instances of nodedown detection, i/o
fencing, split-brain detection and intra-node communication can
be substituted.
- Flexibility so the infrastructure can be tuned for different
performance constraints and/or membership policy choices.
Also flexibility in when the cluster is formed (SSI needs it
formed at boot time and more loosely-coupled clusters want the
ability to join and leave at more arbitrary times.
- Minimal kernel hooks that do not impact the performance of
normal operations.
|
Project
|
Assigned to
|
|
CI projects with available
versions
|
|
Membership
|
John Byrne
|
|
Internode Communication
|
John Byrne
|
|
Integration of CI with DLM
|
Aneesh Kumar
|
|
Integration of CI with UML
|
Krishna Kumar
|
|
Infiniband interconnect
|
Stan Smith
|
|
Dual path interconnect with transparent failover
|
Jaideep, En Chiang
|
|
Ongoing CI projects
|
|
Enforce membership via node numbers and/or IP/MAC addresses
|
Jaideep
|
|
Integration with STOMITH
|
Jaideep
|
|
Quorum integration
|
Jaideep
|
|
Split brain detection/avoidance
|
Jaideep, Roger Tsang
|
|
Convert code to be dynamically added modules
|
Laura Ramirez
|
|
CI projects not yet started
|
|
Membership
|
|
Internode Communication
|
Clusters across subnets
|
open
|
|
|
Other interconnects
|
open
|
|
Integration of CI with Heartbeat
|
open
|
|
Integration of CI with DRBD
|
open
|
|
Integration of CI with Beowulf
|
open
|
|
Integration of CI with EVMS
|
open
|
There are two major components to the Cluster Infrastructure at this
time - Cluster Membership (CLMS) and Internode Communication
Subsystem (ICS):
Cluster Membership (CLMS)
[Top]
- Configurable to be initiated before/after root mounting.
- Co-ordinated with ICS to setup connections in the kernel
before/after initialization of TCP/IP stack.
- Handles initial cluster formation and adding/losing
nodes at later times
- Online adding of new nodes
- Strict maintenance of membership, even in the face
or arbitrary node failures
- Set of membership APIs (libcluster), including a
membership transition history guaranteed to go forward and
which is synchronized/replicated amoung all nodes, even
those that boot later
- Coordination of cleanup/teardown when a node goes down
- Ensures a node doesn't rejoin before everyone has
finished all cleanup processing
- Master-driven membership algorithm with rapid failover
- Set of known nodes (known as candidates) which can be
CLMS master (lilo extensions for now)
- Nodedown detection and API driven membership
- Architected to allow a largely separate kernel nodedown
detection technology
- Architected to allow policy hooks for nodedown and nodeup
decisions
- Can signal SIGCLUSTER to processes which want it on
any cluster membership transition
- Optionally manages a node thru a set of states from
booting to appropriate run level to shutdown or failure
- Optionally co-ordinates with init to bring the cluster
as a whole to and thru designed run levels
- Optional registration system so kernel subsystems can
be called for nodeup and nodedown events
- Optional key service management system (more on key
services later)
- Could be easily adapted for loosely coupled clusters
(non-SSI, non-shared root HA clusters)
- Run in clusters up to 30 nodes and designed to
accommodate much larger clusters
- Running in a 2.4.x Linux kernel with minimal patches
Internode Communication Subsystem (ICS)
[Top]
- Architected as a kernel-to-kernel communication subsystem
- Designed to be able to start up connections before/after
initialization of TCP/IP stack.
- Could be used in more loosely coupled cluster environments
- Works with CLMS to form a tightly coupled (membershipwise)
environment where all nodes agree on the membership list and have
communication with all other nodes
- There is a set of communication channels between each node,
flow control is per channel
- Supports variable message size (at least 64K messages)
- Queueing of outgoing messages
- Dynamic service pool of kernel server processes
- Out-of-line data type for large chunks of data and transports
that support pull or push DMA
- Priority of messages to avoid deadlock
- Incoming message queueing
- Nodedown interfaces and co-ordination with CLMS and subsystems
- Nodedown code to error out outgoing messages, flush incoming
messages and kill/waitfor server processes processing messages from
the node that went down
- Architected with transport independent and dependent pieces
(has run with tcp/ip and ServerNet)
- Supports 3 communication paradigms:
- one way messages
- traditional RPCs, where client must synchronously wait
for response
- request/response or async RPC, where requestor can
choose when to wait for response
- Very simple generation language (ICSgen)
- Works with XDR/RPCgen
- Handles signal forwarding from a client node to a node providing
service, to allow interruption or job control
- Operational in a Linux 2.4.x kernel with minimal patches
- The Cluster Infrastructure (CI) download is completely
operational on 32-bit Intel machines (including SMP).
- We have limited testing of clusters up to 10 nodes in size
and with nodes numbered up to 50
- The nodedown detection code is currently primitive and takes
up to 20 seconds to detect a nodedown. Soon we expect to improve
this to subsecond detection. Nodedown cleanup, reconciliation
and notification after detection is measured at <100 milliseconds
for 3 node clusters (larger clusters shouldn't take much longer but
haven't been measured).
- ICS channel flow control / throttling isn't working yet. We
have to figure out the appropriate low memory conditions to cause
the triggering.
- Makefile
- Include cluster directory in make. (#ifdef CONFIG_CLUSTER)
- Documentation/Configure.help
- Documentation for Cluster features.
- arch/i386/config.in
arch/alpha/config.in
- Add Cluster features to config menu.
- arch/i386/defconfig
arch/alpha/defconfig
- Turn Clustering features on by default.
- arch/i386/kernel/entry.S
arch/alpha/kernel/entry.S
- Add sys_ssisys to system call jump table. (should
rename this or use /proc to get/receive information)
- include/linux/rwsem.h
- Add arch-independent read/write trylock routines.
- include/asm-i386/rwsem.h
- Add arch-specific read/write trylock implementations.
- include/linux/rwsem-spinlock.h
- Add FASTCALLs for read/write trylocks.
- lib/rwsem-spinlock.c
- Add library implementations of read/write trylocks.
- include/asm-i386/unistd.h
include/asm-alpha/unistd.h
- #define ssisys system call number. (should rename
this or use /proc to get/receive information)
- arch/i386/kernel/signal.c
arch/alpha/kernel/signal.c
- Ignore SIGCLUSTER by default.
- include/asm-i386/signal.h
include/asm-alpha/signal.h
- #define SIGMIGRATE and SIGCLUSTER signals (SIGMIGRATE
actually for process migration and not needed for this project)
- kernel/signal.c
- Ignore SIGCLUSTER by default.
- Added routine do_sigtoallproc_local() to signal all
local processes. Used for SIGCLUSTER notification after a
membership event (#ifdef CONFIG_CLUSTER)
- kernel/timer.c
- In count_active_tasks(), don't count task if its
is_kthread flag is set and task is in TASK_UNINTERRUPTIBLE
sleep. (#ifdef CONFIG_CLUSTER) (should be able to do this
another way and eliminate this hook)
- kernel/fork.c
- In get_pid(), allocate clusterwide pid numbers.
(#ifdef CPID)
- include/linux/threads.h
- Increase PID_MAX for clusterwide pids. (#ifdef CPID)
- include/linux/sched.h
- Include icsprio member in task_struct. (#ifdef CONFIG_ICS)
- Include is_kthread flag in task_struct. (#ifdef
CONFIG_CLUSTER) (should be able to eliminate these hooks)
- init/main.c
- Add hooks to process cluster parameters from command
line. (#ifdef CONFIG_CLUSTER)
- Add hooks to call cluster_main_init_preroot /
cluster_main_init_postroot. (#ifdef CONFIG_CLUSTER)
- include/linux/if.h
- Define IFF_ICS interface flag for ICS interfaces.
(#ifdef CONFIG_ICS)
- include/linux/netdevice.h
- Add ics_flags member to net_device structure to hold
ICS-related interface flags. (#ifdef CONFIG_ICS)
- net/core/dev.c
- Prevent shutdown of interfaces with IFF_ICS flag set.
- net/ipv4/tcp.c
- Add ics_recvmsg() routine. (#ifdef CONFIG_ICS)
- Reconciliation and integration with STOMITH and Split-brain code not done
- Kernel hooks could be reduced.
- No clean way for a node to leave the cluster (besides a reboot).
- There is no configuration information on who the allowed cluster members are, just who the possible Membership Masters are.
- Support for overlapping clusters has been requested.
- Support for hierarchical clusters has been requested, where the impact might be that non-core nodes wouldn't be monitored for nodedown nearly as frequently as core nodes containing resources that have to be quickly recovered.
- Have libcluster use a /proc interface instead of or in addition to the system call.
- Clean up SMP locking to not go thru an intermediate set of macros.
- Layer some of the other cluster projects, like
DLM,
LVS,
FailSafe,
GFS and
SSI.
|