4. HFS Ups and Downs

HFS advert

The Hierarchical File Server Service (HFS) that provides the OUCS Backup and Archive services experienced a dramatic break in service recently. On 17th March, the HFS team discovered a corruption to the TSM Database. Then at about 7:30pm, in an independent incident, the system suffered a major fault with the Disk Server that hosts all the TSM databases. As a result, all HFS services were brought down suddenly. A Severity 1 support call with IBM was opened.

Initial diagnostics showed that both controllers in a Disk Server had crashed, with the data in an unknown state. A decision was therefore taken to restore all TSM server databases from tape backups. The initial hurdle was to find enough spare disk to host the restored databases – which after RAID mirroring require more than 1TB per copy. With a considerable amount of hard work by the HFS team, this was done, the databases restored and all but one rolled-forward to a point just prior to the crash. At the same time, the database suffering corruption was also restored.

All services were back on-line by lunchtime March 19th.A small number of people had manually backed up their systems between the last good copy of the database and its failure - these were contacted and asked to resend their last backup. We believe that as a result, no user backups were lost. This incident occurred in spite of a design of complete redundancy in the Disk Servers such that any one component can fail and not affect data access. The primary cause was a version of device code that effectively brought down both controllers. This will now be addressed by designing resilience around this possibility.

The HFS team would like to thank all those who offered us help and support during this time and to thank its users fo rpatience while we strived to return the service to operation.

More on the HFS services

4.1. HFS Tools for IT Support Staff

Two more HFS/OUCS Registration tools that ITSS may find useful have been created. You can set a new owner for an account whose owner has left (via ‘TSM Clients - no owner’) and you can register an account with another user as owner. Help on using the HFS.

Up: Contents Previous: 3. The New WebLearn - Launch 30th June 2009 Next: 5. OUCS Welcomes Three New Staff Members