Concepts
Architecture
The General Storage Cluster Controller is a high-availability solution specifically designed for the IBM Spectrum Protect (IBM Tivoli Storage Manager) server. GSCC is no conventional server cluster, but an application cluster that is tailored to the requirements of ISP servers. As this solution is focused on the application not only hardware failures, but also functional errors and logical problems of the application can be monitored and resolved if necessary. GSCC manage single ISP server instances as independent objects. Each of these instances can be prevented from fulfilling their actual tasks due to a number of different errors. GSCC analyzes these errors and fixes them deliberately and with reasonable means. In contrast to traditional clustering solutions the application is not simply shifted to another system, if the error can be solved locally or the shift would not solve the problem. GSCC monitors all ISP instances and continuously checks their functionality and associated resources (storage pools, LAN connection, etc.). If the error states cannot be solved locally, GSCC initiates the takeover by another system. If the error persists there as well, possibly more drastic corrections are required. GSCC can even respond to errors in the ISP database and fall back to a consistent standby copy of the database if necessary. These additional steps and functions can also be performed manually and accurately with GSCC. ISP server instances can be started, stopped and moved via a GUI or Command Line in order to perform configuration changes, hardware or software maintenance to even for load balance reasons. Storage agent and ISP client environments are transparently serviced by the cluster. GSCC is a very flexible and configurable cluster solution. It supports several takeover scenarios by utilizing two main takeover mechanisms, which can be combined or differently prioritized: Shared disk failover
Volume Manager Failover like LVM
Shared Filesystem like GPFS
Standby database failover
TSM Classic Sync
DB2 HADR (High Availability Disaster Recovery) log shipping
The shared disk failover describes the switch of a productive database from one UNIX system to another by accessing the original disk volumes containing TSM configuration, database and logs. These volumes could be SAN disks configured with AIX LVM, Veritas VM, SVM, ZFS or GPFS. The standby failover refers to the activation of a second independent database copy not sharing any database resources with the productive ISP instance. Two replication mechanisms can be used to create this database copy. GSCC supports ISP replication by restore and since TSM v6 with the DB2 function HADR (High Availability Disaster Recovery). While ISP replication by restore is the most asynchronous copy of the database, HADR allows the synchronizing level to be defined from asynchronous up to fully synchronous. ISP replication by restore is referred as classic sync and is a legacy function mostly used with TSM v5, while since TSM v6 HADR is the preferred replication method. Based on a HADR synchronous standby copy GSCC supports HADR only clusters with no shared database disk resources (IP cluster). There are certain advantages and disadvantages to the different failover methods and the decision depends on the requirements. However, the way GSCC reacts can be easily changed by using just different “rulesets” defined in a so-called expert domain. GSCC takes care about the TSM resources including the volume management tasks in a shared disk scenario and the HADR tasks in a standby failover situation. Most important GSCC also takes care about the TSM server process itself and the IP addresses used by TSM clients and storage agents.
Components
SPORD
Daemons
ISP Layout
DB2 HADR
State Model
Expert Domain
The entity to configure the GSCC behavior is the expert domain. A expert domain consists of defined states and rules. An expert domain can be changed easily, when the requirements to TSM for that behavior is met. There are four main types of expert domains to manage TSM instances. The types of expert domain differ mainly in the way the failover occurs. As explained before GSCC supports HADR configurations and different volume manager configurations. This means GSCC can use database functions to take over the TSM instances and/or volume manager functions to make a TSM instance available again. The decision which is the better alternative is really depending on the environment and the requirements. GSCC offers therefore all possible combinations, which is possible due to the expert domain architecture. The four main alternatives are shown in the following chart. The upper configurations are using a member team with 4 members and provide a combination of shared disk failover based on volume management and HADR. Prerequisite here is the possible to activate the resources of the databases independently on both hosts. Manually both takeover ways can be performed, but the main difference in the two expert domains is the primary failover decision in an uncontrolled situation. An uncontrolled situation is detected, when a complete host fails or is unreachable. Expert domain 1 would in such a situation first execute the HADR failover. This is useful when a shared disk failover is considered more likely to fail as a result of reservation or limitation on the shared disk subsystem. Expert Domain 2 in contrast would primarily failover the primary database by using the shared disk resources. However, in a second step even here the standby can be used to activate TSM. The two lower expert domains are limited to a single takeover alternative. The HADR, IP only alternative can only failover with HADR methods. It is typically used in a non-SAN setup where even the TSM resources are on local disks. The 4th alternative expert domain is only using shared disks to failover to the other hosts. HADR is not configured at all. This is a version where the complexity is not needed and in case of major issues a restore is quickly performed as it is small in size.
The GSCC cluster supports different failover scenarios. These are the two main mechanisms that are used. These can be configured by itself or in a combination with the other.
Shared disk failover
Volume Manager Failover like LVM
Shared Filesystem like GPFS
Standby database failover
TSM Classic Sync
DB2 HADR (High Availability Disaster Recovery) log shipping
As GSCC is based on a state model the behavior can be changed relatively easy. The needed rules are defined in rule sets. These rule sets then are combined in a so called “Expert Domain”. There are tested and verified “Expert Domains” for the different failover scenarios. It is recommended to use these predefined configurations when using GSCC. However, it is possible to adjust certain detailed steps according to different requirements.