[an error occurred while processing this directive] Cluster Infrastructure for Linux (CI) [an error occurred while processing this directive] [an error occurred while processing this directive] CI Project   Overview   License   Download   CVS   Contributed Code   Mailing Lists   Goals   Project List   Features   CLMS Features   ICS Features   Current Status   Kernel Hooks   Limitations / Enhancements     CI Documentation   Kernel Patch   Cluster Tools     [an error occurred while processing this directive]

Overview

This project is developing a common infrastructure for Linux clustering by extending the CLuster Membership Subsystem ("CLMS") and Internode Communication Subsystem ("ICS") of the OpenSSI project.

The Sourceforge.net project summary page is located here.

License [Top]

Both the SSI and CI code are being released under the GNU General Public Licence (GPL), version 2. This is the same license used by the Linux kernel.

Download [Top]

CVS [Top]

Contributed Code [Top]

You can find contributed patches and such here. See the mailing list archive for context.

Mailing Lists [Top]

Goals [Top]

  1. Provide a cluster infrastructure that can be used as the basis for many different cluster products and projects, including HA failover, load leveling, parallel filesystem, HPC and Single System Image (SSI).
  2. Provide a membership service functional enough for each of the cluster environments with an easy way to add subsystems to the environment and a set of APIs to build cluster subsystems and cluster-aware applications.
  3. Modularity, so other instances of nodedown detection, i/o fencing, split-brain detection and intra-node communication can be substituted.
  4. Flexibility so the infrastructure can be tuned for different performance constraints and/or membership policy choices. Also flexibility in when the cluster is formed (SSI needs it formed at boot time and more loosely-coupled clusters want the ability to join and leave at more arbitrary times.
  5. Minimal kernel hooks that do not impact the performance of normal operations.

Project List [Top]

Project Assigned to
CI projects with available versions
Membership John Byrne
Internode Communication John Byrne
Integration of CI with DLM Aneesh Kumar
Integration of CI with UML Krishna Kumar
Infiniband interconnect Stan Smith
Dual path interconnect with transparent failover Jaideep, En Chiang
Ongoing CI projects
Enforce membership via node numbers and/or IP/MAC addresses Jaideep
Integration with STOMITH Jaideep
Quorum integration Jaideep
Split brain detection/avoidance Jaideep, Roger Tsang
Convert code to be dynamically added modules Laura Ramirez
CI projects not yet started
Membership
Internode Communication
Clusters across subnets open
Other interconnects open
Integration of CI with Heartbeat open
Integration of CI with DRBD open
Integration of CI with Beowulf open
Integration of CI with EVMS open

Features [Top]

There are two major components to the Cluster Infrastructure at this time - Cluster Membership (CLMS) and Internode Communication Subsystem (ICS):

Cluster Membership (CLMS) [Top]

  1. Configurable to be initiated before/after root mounting.
  2. Co-ordinated with ICS to setup connections in the kernel before/after initialization of TCP/IP stack.
  3. Handles initial cluster formation and adding/losing nodes at later times
  4. Online adding of new nodes
  5. Strict maintenance of membership, even in the face or arbitrary node failures
  6. Set of membership APIs (libcluster), including a membership transition history guaranteed to go forward and which is synchronized/replicated amoung all nodes, even those that boot later
  7. Coordination of cleanup/teardown when a node goes down
  8. Ensures a node doesn't rejoin before everyone has finished all cleanup processing
  9. Master-driven membership algorithm with rapid failover
  10. Set of known nodes (known as candidates) which can be CLMS master (lilo extensions for now)
  11. Nodedown detection and API driven membership
  12. Architected to allow a largely separate kernel nodedown detection technology
  13. Architected to allow policy hooks for nodedown and nodeup decisions
  14. Can signal SIGCLUSTER to processes which want it on any cluster membership transition
  15. Optionally manages a node thru a set of states from booting to appropriate run level to shutdown or failure
  16. Optionally co-ordinates with init to bring the cluster as a whole to and thru designed run levels
  17. Optional registration system so kernel subsystems can be called for nodeup and nodedown events
  18. Optional key service management system (more on key services later)
  19. Could be easily adapted for loosely coupled clusters (non-SSI, non-shared root HA clusters)
  20. Run in clusters up to 30 nodes and designed to accommodate much larger clusters
  21. Running in a 2.4.x Linux kernel with minimal patches
Internode Communication Subsystem (ICS) [Top]
  1. Architected as a kernel-to-kernel communication subsystem
  2. Designed to be able to start up connections before/after initialization of TCP/IP stack.
  3. Could be used in more loosely coupled cluster environments
  4. Works with CLMS to form a tightly coupled (membershipwise) environment where all nodes agree on the membership list and have communication with all other nodes
  5. There is a set of communication channels between each node, flow control is per channel
  6. Supports variable message size (at least 64K messages)
  7. Queueing of outgoing messages
  8. Dynamic service pool of kernel server processes
  9. Out-of-line data type for large chunks of data and transports that support pull or push DMA
  10. Priority of messages to avoid deadlock
  11. Incoming message queueing
  12. Nodedown interfaces and co-ordination with CLMS and subsystems
  13. Nodedown code to error out outgoing messages, flush incoming messages and kill/waitfor server processes processing messages from the node that went down
  14. Architected with transport independent and dependent pieces (has run with tcp/ip and ServerNet)
  15. Supports 3 communication paradigms:
    • one way messages
    • traditional RPCs, where client must synchronously wait for response
    • request/response or async RPC, where requestor can choose when to wait for response
  16. Very simple generation language (ICSgen)
  17. Works with XDR/RPCgen
  18. Handles signal forwarding from a client node to a node providing service, to allow interruption or job control
  19. Operational in a Linux 2.4.x kernel with minimal patches

Current Status [Top]

  1. The Cluster Infrastructure (CI) download is completely operational on 32-bit Intel machines (including SMP).
  2. We have limited testing of clusters up to 10 nodes in size and with nodes numbered up to 50
  3. The nodedown detection code is currently primitive and takes up to 20 seconds to detect a nodedown. Soon we expect to improve this to subsecond detection. Nodedown cleanup, reconciliation and notification after detection is measured at <100 milliseconds for 3 node clusters (larger clusters shouldn't take much longer but haven't been measured).
  4. ICS channel flow control / throttling isn't working yet. We have to figure out the appropriate low memory conditions to cause the triggering.

Kernel Hooks [Top]

  1. Makefile
    • Include cluster directory in make. (#ifdef CONFIG_CLUSTER)
  2. Documentation/Configure.help
    • Documentation for Cluster features.
  3. arch/i386/config.in
    arch/alpha/config.in
    • Add Cluster features to config menu.
  4. arch/i386/defconfig
    arch/alpha/defconfig
    • Turn Clustering features on by default.
  5. arch/i386/kernel/entry.S
    arch/alpha/kernel/entry.S
    • Add sys_ssisys to system call jump table. (should rename this or use /proc to get/receive information)
  6. include/linux/rwsem.h
    • Add arch-independent read/write trylock routines.
  7. include/asm-i386/rwsem.h
    • Add arch-specific read/write trylock implementations.
  8. include/linux/rwsem-spinlock.h
    • Add FASTCALLs for read/write trylocks.
  9. lib/rwsem-spinlock.c
    • Add library implementations of read/write trylocks.
  10. include/asm-i386/unistd.h
    include/asm-alpha/unistd.h
    • #define ssisys system call number. (should rename this or use /proc to get/receive information)
  11. arch/i386/kernel/signal.c
    arch/alpha/kernel/signal.c
    • Ignore SIGCLUSTER by default.
  12. include/asm-i386/signal.h
    include/asm-alpha/signal.h
    • #define SIGMIGRATE and SIGCLUSTER signals (SIGMIGRATE actually for process migration and not needed for this project)
  13. kernel/signal.c
    • Ignore SIGCLUSTER by default.
    • Added routine do_sigtoallproc_local() to signal all local processes. Used for SIGCLUSTER notification after a membership event (#ifdef CONFIG_CLUSTER)
  14. kernel/timer.c
    • In count_active_tasks(), don't count task if its is_kthread flag is set and task is in TASK_UNINTERRUPTIBLE sleep. (#ifdef CONFIG_CLUSTER) (should be able to do this another way and eliminate this hook)
  15. kernel/fork.c
    • In get_pid(), allocate clusterwide pid numbers. (#ifdef CPID)
  16. include/linux/threads.h
    • Increase PID_MAX for clusterwide pids. (#ifdef CPID)
  17. include/linux/sched.h
    • Include icsprio member in task_struct. (#ifdef CONFIG_ICS)
    • Include is_kthread flag in task_struct. (#ifdef CONFIG_CLUSTER) (should be able to eliminate these hooks)
  18. init/main.c
    • Add hooks to process cluster parameters from command line. (#ifdef CONFIG_CLUSTER)
    • Add hooks to call cluster_main_init_preroot / cluster_main_init_postroot. (#ifdef CONFIG_CLUSTER)
  19. include/linux/if.h
    • Define IFF_ICS interface flag for ICS interfaces. (#ifdef CONFIG_ICS)
  20. include/linux/netdevice.h
    • Add ics_flags member to net_device structure to hold ICS-related interface flags. (#ifdef CONFIG_ICS)
  21. net/core/dev.c
    • Prevent shutdown of interfaces with IFF_ICS flag set.
  22. net/ipv4/tcp.c
    • Add ics_recvmsg() routine. (#ifdef CONFIG_ICS)

Known Limitations / Enhancement Areas [Top]

  1. Reconciliation and integration with STOMITH and Split-brain code not done
  2. Kernel hooks could be reduced.
  3. No clean way for a node to leave the cluster (besides a reboot).
  4. There is no configuration information on who the allowed cluster members are, just who the possible Membership Masters are.
  5. Support for overlapping clusters has been requested.
  6. Support for hierarchical clusters has been requested, where the impact might be that non-core nodes wouldn't be monitored for nodedown nearly as frequently as core nodes containing resources that have to be quickly recovered.
  7. Have libcluster use a /proc interface instead of or in addition to the system call.
  8. Clean up SMP locking to not go thru an intermediate set of macros.
  9. Layer some of the other cluster projects, like DLM, LVS, FailSafe, GFS and SSI.

SourceForge Logo

Opensource.hp.com

HP Linux solutions

The Linux Clustering Information Center

This file last updated on Friday, 21-Sep-2007 16:17:22 UTC [an error occurred while processing this directive]