4. HFS Ups and Downs
The Hierarchical File Server Service (HFS) that provides the OUCS Backup and Archive services experienced a dramatic break in service recently. On 17th March, the HFS team discovered a corruption to the TSM Database. Then at about 7:30pm, in an independent incident, the system suffered a major fault with the Disk Server that hosts all the TSM databases. As a result, all HFS services were brought down suddenly. A Severity 1 support call with IBM was opened.
Initial diagnostics showed that both controllers in a Disk Server had crashed, with the data in an unknown state. A decision was therefore taken to restore all TSM server databases from tape backups. The initial hurdle was to find enough spare disk to host the restored databases – which after RAID mirroring require more than 1TB per copy. With a considerable amount of hard work by the HFS team, this was done, the databases restored and all but one rolled-forward to a point just prior to the crash. At the same time, the database suffering corruption was also restored.
All services were back on-line by lunchtime March 19th.A small number of people had manually backed up their systems between the last good copy of the database and its failure - these were contacted and asked to resend their last backup. We believe that as a result, no user backups were lost. This incident occurred in spite of a design of complete redundancy in the Disk Servers such that any one component can fail and not affect data access. The primary cause was a version of device code that effectively brought down both controllers. This will now be addressed by designing resilience around this possibility.
Two more HFS/OUCS Registration tools that ITSS may find useful have been created. You can set a new owner for an account whose owner has left (via ‘TSM Clients - no owner’) and you can register an account with another user as owner. Help on using the HFS.