7. Recommendations

The new CMR program is intended to deliver a direct increase in both the physical and power capacity available to OUCS. There is a substantial opportunity within the new design to enhance this through data centre and IT efficiency improvements to deliver more effective utilisation of the available physical and energy capacity and therefore further improvement in the available computing services. This section summarises three sets of direct recommendations for the facility in the areas of energy metering, implementation and operational practices and the management of services within the facilities.

7.1. Measuring operational energy efficiency

One of the benefits of designing and building a green field data centre is the ability to include effective and useful instrumentation within the design at little additional cost.

Before spending additional money on metering equipment it is important to understand what benefits are expected from this instrumentation and the uses to which the data can be put. Many operators have installed expensive products that meter at the IT power strip or socket level in the data centre. Whilst per socket metering can provide very granular information this data is not directly useful to OUCS who do not charge occupants collocation fees including a multiple of metered power. As indicated in section 6.2.2 it is not reasonable or useful to allocate energy use or cost to a device based only on the metered power at the PSU, further, as technology advances and more devices are virtualised the notional relationship between a power socket and a logical server or service component is removed. Many IT devices also now report their energy use through management APIs (this is required in the new Energy Star for Compute Servers standard) which avoids the additional problems involved in associating a physical IT device with a numbered socket.

Whilst many operators have installed instrumentation to measure power within the data centre this is normally the energy used by the IT equipment and not the energy we are concerned with, the energy wasted by non-IT equipment. If this is measured at all it is frequently in Building Management Software and not visible to the IT department.

Figure 5.
Figure 5.

If the goal is to understand and manage IT energy use, energy efficiency and cost then a combination of IT and Mechanical and Electrical device consumption metering is required. PDU or rack level metering of the IT load is all that is necessary for OUCS when coupled with effective reporting of the data centre infrastructure loads.

Figure 5 illustrates three levels of data centre energy instrumentation. It is recommended for the OUCS CMR that at least the detailed measurement points (light blue) be instrumented. The meters should be network connected and able to log their data to a central station. Many BMS systems available today are able to accept logging data from remote energy meters. This data should be available from the BMS to other software. The historic reporting data should be stored in as granular a form as possible, at least hourly if not at the 300 second IT polling rate.

Data captured from the detailed measurement points will allow the simulation of future energy impacting changes and thus give a much greater understanding and control of energy management within the CMR.

7.2. Applying the EU Code of Conduct for Data Centres

It is recommended that Oxford University implement the EU Code of Conduct for Data Centre operators. The Code details best practices that apply to M and E infrastructure, IT systems and software. The best practices have been developed and reviewed by industry experts and represent a practical and effective approach to energy efficient design and on-going energy management within the data centre.

Romonet recommends that Oxford University implements the following best practices defined in the EU Code of Conduct for data centres at a minimum. Ideally Oxford University would become a full Participant of the code and report its energy data back to the EU under the requirements of Participant status. The mandatory requirement for reporting of high-level power data will ensure regular (six monthly) senior visibility of the energy data from the CMR and an increased understanding and appreciation of how the efficiency of the facility is being managed. Regular reviews of the reporting will also give a much better appreciation of the rate of growth in IT service related energy.

7.2.1. Design stage best practices

Listed below are the minimum best practices expected to be implemented during a new-build or major data centre re-fit.

TypeDescriptionImplementation stage
Group involvement Establish an approval board containing representatives from all disciplines (software, IT, M and E). Require approval for any significant decision to ensure that the impacts of the decision have been properly understood and an optimal solution reached. For example, this would include the definition of standard IT hardware lists. Design and Operational
Build resilience to business requirements Only the level of resilience actually justified by business requirements and impact analysis should be built. 2N infrastructures are frequently unnecessary and inappropriate. Resilience for a small portion of critical services can be obtained using DR / BC sites. Design
Consider multiple levels of resilience It is possible to build a single data centre to provide multiple levels of power and cooling resilience to different floor areas. Many co-location providers already deliver this, for example, optional ‘grey’ power feeds without UPS or generator back up. Design
Design – Contained hot or cold air There are a number of design concepts whose basic intent is to contain and separate the cold air from the heated return air on the data floor:
  • Hot aisle containment
  • Cold aisle containment
  • Contained rack supply, room return
  • Room supply, Contained rack return
  • Contained rack supply, Contained rack return
This action is expected for air-cooled facilities over 1kW per square meter power density.
Design and Operational
Efficient part load operation Optimise the facility for the partial load it will experience for most of operational time rather than max load. e.g. sequence chillers, operate cooling towers with shared load for increased heat exchange area Design
Variable Speed Fans Many old CRAC units operate fixed speed fans which consume substantial power and obstruct attempts to manage the data floor temperature. This is particularly effective where there is a high level of redundancy in the cooling system, low utilisation of the facility or highly variable IT electrical load. Design
Modular UPS Deployment It is now possible to purchase modular UPS systems across a broad range of power delivery. Physical installation, transformers and cabling are prepared to meet the design electrical load of the facility but the sources of inefficiency, switching units and batteries are installed, as required in modular units. This substantially reduces both the capital cost and the fixed overhead losses of these systems. In low power environments these may be frames with plug in modules, in larger environments these are likely to be entire UPS units. Design
Lean provisioning of power and cooling for a maximum of 18 months of data floor capacity The provisioning of excess power and cooling capacity in the data centre drives substantial fixed losses and is unnecessary. Planning a data centre for modular expansion and then building out this capacity in a rolling program of deployments is more efficient. This also allows the technology ‘generation’ of the IT equipment and supporting M and E infrastructure to be matched, improving both efficiency and the ability to respond to business requirements. Design

7.2.2. Operational stage best practices

Listed below are the minimum expected best practices that should be implemented once the new CMR goes live.

PracticeDescription.Type
Multiple tender for IT hardware – Power Include the Performance per Watt of the IT device as a high priority decision factor in the tender process. This may be through the use of Energy Star or SPEC Power type standard metrics or through application or deployment specific user metrics more closely aligned to the target environment. The power consumption of the device at the expected utilisation or applied workload should be considered in addition to peak performance per Watt figures. Operational
Multiple tender for IT hardware – Basic operating temperature and humidity range Include the operating temperature and humidity ranges of new equipment as high priority decision factors in the tender process. The minimum is the ASHRAE Recommended range for Class 1 Data Centers, 18-27C and 5.5C dew point up to 15C dew point and 60% RH Operational
Enable power management features Formally change the deployment process to include the checking and enabling of power management features on hardware. Operational
Provision to the as configured power Provision power and cooling only to the as-configured power draw capability of the equipment, not the PSU or nameplate rating. Operational
Deploy using Grid and Virtualisation Processes should be put in place to require senior business approval for any new service that requires dedicated hardware and will not run on a resource sharing grid or virtualised platform. Operational
Reduce IT hardware resilience level Determine the business impact of service incidents for each deployed service and deploy only the level of hardware resilience actually justified. Operational
Reduce Hot / Cold standby equipment Determine the business impact of service incidents for each deployed service and deploy only the level of site BC / DR actually required. Operational
Select efficient software Make the performance of the software, in terms of the power draw of the hardware required to meet performance and availability targets a primary selection factor. Operational
Develop efficient software Make the performance of the software, in terms of the power draw of the hardware required to meet performance and availability targets a critical success factor. Operational
Decommission unused services Completely decommission and switch off, preferably remove, the supporting hardware for unused services Operational
Data Management Policy Develop a data management policy to define which data should be kept, for how long and at what level of protection. Communicate the policy to users and enforce. Particular care should be taken to understand the impact of any data retention requirements. Operational
Rack air flow management – Blanking Plates Installation of blanking plates where there is no equipment to reduce cold air passing through gaps in the rack. This also reduces air heated by one device being ingested by another device, increasing intake temperature and reducing efficiency. Operational
Rack air flow management – Other Openings Installation of aperture brushes (draught excluders) or cover plates to cover all air leakage opportunities in each rack. This includes:
  • Floor openings at the base of the rack
  • Gaps at the sides, top and bottom of the rack between equipment or mounting rails and the perimeter of the rack
Operational
Raised floor air flow management Close all unwanted apertures in the raised floor. Review placement and opening factors of vented tiles. Maintain unbroken rows of cabinets to prevent bypass air – where necessary fill with empty fully blanked racks. Managing unbroken rows is especially important in hot and cold aisle environments. Any opening between the aisles will degrade the separation of hot and cold air. Operational
Provide adequate free area on rack doors Solid doors can be replaced (where doors are necessary) with partially perforated doors to ensure adequate cooling airflow which often impede the cooling airflow and may promote recirculation within the enclosed cabinet further increasing the equipment intake temperature. Operational
Review of cooling before IT equipment changes The availability of cooling including the placement and flow of vented tiles should be reviewed before each IT equipment change to optimise the use of cooling resources. Operational
Review of cooling strategy Periodically review the IT equipment and cooling deployment against strategy. Operational

7.3. Service categorisation and grouping

The services delivered by OUCS should be reviewed to determine the required resilience and availability levels for each service. For user facing services (such as email) this is dependent upon the level of direct user impact from service failures whilst for infrastructure services (such as dns) this is determined from the impact upon user facing services (such as Internet access).

As OUCS will have multiple machine rooms and the ability to deliver multiple levels of resilience at the facility, hardware, network and software level it is sensible to define a series of standard offerings to meet the range of common availability criteria. The following table suggests a range of recovery and continuity levels and mechanisms as an example.

LevelDescriptionImplementationContinuity or Recovery Mechanism
Low-medium Low protection. Single physical or logical server, UPS protected Repair or restore from backup.
Medium Manual recovery. Single logical server, UPS protected, VM image on shared disk Manually invoke virtual machine on alternate hardware in same or alternate machine room
Low No protection. Single physical or logical server, single feed, no UPS Repair or restore from backup.
High Auto Recovery Logical server with cold standby hardware in separate machine room, replicated shared data Automatically invoke virtual machine on designated hardware in alternate machine room.
Very high Continuity Active / Active logical or physical servers in separate machine rooms Automatic redirection of user traffic.

It should be noted that hardware level disk replication is only necessary for legacy applications that offer no higher level replication of data and should not be viewed as a strategic solution.

7.4. Service catalogue and CMDB

Many of the best practices identified above will be easier to achieve if a comprehensive Service Catalogue is implemented and maintained. Alongside the Service Catalogue, a Configuration Management Database should be established and rigorously maintained for the new CMR to ensure that existing IT equipment is both documented and controlled. All changes in location, connections, memory, storage etc. – i.e. “configuration” – should be recorded in detail in the CMDB.

Up: Contents Previous: 6. Understanding energy efficiency within the data centre Next: 8. Glossary