7. Recommendations
The new CMR program is intended to deliver a direct increase in both the physical
and power capacity available to OUCS. There is a substantial opportunity within
the new design to enhance this through data centre and IT efficiency
improvements to deliver more effective utilisation of the available physical and
energy capacity and therefore further improvement in the available computing
services. This section summarises three sets of direct recommendations for the
facility in the areas of energy metering, implementation and operational
practices and the management of services within the facilities.
7.1. Measuring operational energy efficiency
One of the benefits of designing and building a green field data centre is
the ability to include effective and useful instrumentation within the
design at little additional cost.
Before spending additional money on metering equipment it is important to
understand what benefits are expected from this instrumentation and the uses
to which the data can be put. Many operators have installed expensive
products that meter at the IT power strip or socket level in the data
centre. Whilst per socket metering can provide very granular information
this data is not directly useful to OUCS who do not charge occupants
collocation fees including a multiple of metered power. As indicated in
section 6.2.2 it is not reasonable or useful to allocate energy use or cost
to a device based only on the metered power at the PSU, further, as
technology advances and more devices are virtualised the notional
relationship between a power socket and a logical server or service
component is removed. Many IT devices also now report their energy use
through management APIs (this is required in the new Energy Star for Compute
Servers standard) which avoids the additional problems involved in
associating a physical IT device with a numbered socket.
Whilst many operators have installed instrumentation to measure power within
the data centre this is normally the energy used by the IT equipment and not
the energy we are concerned with, the energy wasted by non-IT equipment. If
this is measured at all it is frequently in Building Management Software and
not visible to the IT department.
If the goal is to understand and manage IT energy use, energy efficiency and
cost then a combination of IT and Mechanical and Electrical device
consumption metering is required. PDU or rack level metering of the IT load
is all that is necessary for OUCS when coupled with effective reporting of
the data centre infrastructure loads.
Figure 5 illustrates three levels of data centre energy instrumentation. It
is recommended for the OUCS CMR that at least the detailed measurement
points (light blue) be instrumented. The meters should be network connected
and able to log their data to a central station. Many BMS systems available
today are able to accept logging data from remote energy meters. This data
should be available from the BMS to other software. The historic reporting
data should be stored in as granular a form as possible, at least hourly if
not at the 300 second IT polling rate.
Data captured from the detailed measurement points will allow the simulation
of future energy impacting changes and thus give a much greater
understanding and control of energy management within the CMR.
7.2. Applying the EU Code of Conduct for Data Centres
It is recommended that Oxford University implement the EU Code of Conduct for
Data Centre operators. The Code details best practices that apply to M and E
infrastructure, IT systems and software. The best practices have been
developed and reviewed by industry experts and represent a practical and
effective approach to energy efficient design and on-going energy management
within the data centre.
Romonet recommends that Oxford University implements the following best
practices defined in the EU Code of Conduct for data centres at a minimum.
Ideally Oxford University would become a full Participant of the code and
report its energy data back to the EU under the requirements of Participant
status. The mandatory requirement for reporting of high-level power data
will ensure regular (six monthly) senior visibility of the energy data from
the CMR and an increased understanding and appreciation of how the
efficiency of the facility is being managed. Regular reviews of the
reporting will also give a much better appreciation of the rate of growth in
IT service related energy.
7.2.1. Design stage best practices
Listed below are the minimum best practices expected to be implemented
during a new-build or major data centre re-fit.
| Type | Description | Implementation stage |
|---|
| Group involvement |
Establish an approval board containing representatives from
all disciplines (software, IT, M and E). Require approval for
any significant decision to ensure that the impacts of the
decision have been properly understood and an optimal solution
reached. For example, this would include the definition of
standard IT hardware lists. |
Design and Operational |
| Build resilience to business requirements |
Only the level of resilience actually justified by business
requirements and impact analysis should be built. 2N
infrastructures are frequently unnecessary and inappropriate.
Resilience for a small portion of critical services can be
obtained using DR / BC sites. |
Design |
| Consider multiple levels of resilience |
It is possible to build a single data centre to provide
multiple levels of power and cooling resilience to different
floor areas. Many co-location providers already deliver this,
for example, optional ‘grey’ power feeds without UPS or
generator back up. |
Design |
| Design – Contained hot or cold air |
There are a number of design concepts whose basic intent is to
contain and separate the cold air from the heated return air on
the data floor: - Hot aisle containment
- Cold aisle containment
- Contained rack supply, room return
- Room supply, Contained rack return
- Contained rack supply, Contained rack return
This action is expected for air-cooled facilities over
1kW per square meter power density. |
Design and Operational |
| Efficient part load operation |
Optimise the facility for the partial load it will experience
for most of operational time rather than max load. e.g. sequence
chillers, operate cooling towers with shared load for increased
heat exchange area |
Design |
| Variable Speed Fans |
Many old CRAC units operate fixed speed fans which consume
substantial power and obstruct attempts to manage the data floor
temperature. This is particularly effective where there is a
high level of redundancy in the cooling system, low utilisation
of the facility or highly variable IT electrical load. |
Design |
| Modular UPS Deployment |
It is now possible to purchase modular UPS systems across a
broad range of power delivery. Physical installation,
transformers and cabling are prepared to meet the design
electrical load of the facility but the sources of inefficiency,
switching units and batteries are installed, as required in
modular units. This substantially reduces both the capital cost
and the fixed overhead losses of these systems. In low power
environments these may be frames with plug in modules, in larger
environments these are likely to be entire UPS units. |
Design |
| Lean provisioning of power and cooling for a maximum of 18
months of data floor capacity |
The provisioning of excess power and cooling capacity in the
data centre drives substantial fixed losses and is unnecessary.
Planning a data centre for modular expansion and then building
out this capacity in a rolling program of deployments is more
efficient. This also allows the technology ‘generation’ of the
IT equipment and supporting M and E infrastructure to be
matched, improving both efficiency and the ability to respond to
business requirements. |
Design |
7.2.2. Operational stage best practices
Listed below are the minimum expected best practices that should be
implemented once the new CMR goes live.
| Practice | Description. | Type |
|---|
| Multiple tender for IT hardware – Power |
Include the Performance per Watt of the IT device as a high
priority decision factor in the tender process. This may be
through the use of Energy Star or SPEC Power type standard
metrics or through application or deployment specific user
metrics more closely aligned to the target environment. The
power consumption of the device at the expected utilisation or
applied workload should be considered in addition to peak
performance per Watt figures. |
Operational |
| Multiple tender for IT hardware – Basic operating temperature
and humidity range |
Include the operating temperature and humidity ranges of new
equipment as high priority decision factors in the tender
process. The minimum is the ASHRAE Recommended range for Class 1
Data Centers, 18-27C and 5.5C dew point up to 15C dew point and
60% RH |
Operational |
| Enable power management features |
Formally change the deployment process to include the checking
and enabling of power management features on hardware. |
Operational |
| Provision to the as configured power |
Provision power and cooling only to the as-configured power
draw capability of the equipment, not the PSU or nameplate
rating. |
Operational |
| Deploy using Grid and Virtualisation |
Processes should be put in place to require senior business
approval for any new service that requires dedicated hardware
and will not run on a resource sharing grid or virtualised
platform. |
Operational |
| Reduce IT hardware resilience level |
Determine the business impact of service incidents for each
deployed service and deploy only the level of hardware
resilience actually justified. |
Operational |
| Reduce Hot / Cold standby equipment |
Determine the business impact of service incidents for each
deployed service and deploy only the level of site BC / DR
actually required. |
Operational |
| Select efficient software |
Make the performance of the software, in terms of the power
draw of the hardware required to meet performance and
availability targets a primary selection factor. |
Operational |
| Develop efficient software |
Make the performance of the software, in terms of the power
draw of the hardware required to meet performance and
availability targets a critical success factor. |
Operational |
| Decommission unused services |
Completely decommission and switch off, preferably remove, the
supporting hardware for unused services |
Operational |
| Data Management Policy |
Develop a data management policy to define which data should
be kept, for how long and at what level of protection.
Communicate the policy to users and enforce. Particular care
should be taken to understand the impact of any data retention
requirements. |
Operational |
| Rack air flow management – Blanking Plates |
Installation of blanking plates where there is no equipment to
reduce cold air passing through gaps in the rack. This also
reduces air heated by one device being ingested by another
device, increasing intake temperature and reducing
efficiency. |
Operational |
| Rack air flow management – Other Openings |
Installation of aperture brushes (draught excluders) or cover
plates to cover all air leakage opportunities in each rack. This
includes: - Floor openings at the base of the rack
- Gaps at the sides, top and bottom of the rack between
equipment or mounting rails and the perimeter of the
rack
|
Operational |
| Raised floor air flow management |
Close all unwanted apertures in the raised floor. Review
placement and opening factors of vented tiles. Maintain unbroken
rows of cabinets to prevent bypass air – where necessary fill
with empty fully blanked racks. Managing unbroken rows is
especially important in hot and cold aisle environments. Any
opening between the aisles will degrade the separation of hot
and cold air. |
Operational |
| Provide adequate free area on rack doors |
Solid doors can be replaced (where doors are necessary) with
partially perforated doors to ensure adequate cooling airflow
which often impede the cooling airflow and may promote
recirculation within the enclosed cabinet further increasing the
equipment intake temperature. |
Operational |
| Review of cooling before IT equipment changes |
The availability of cooling including the placement and flow
of vented tiles should be reviewed before each IT equipment
change to optimise the use of cooling resources. |
Operational |
| Review of cooling strategy |
Periodically review the IT equipment and cooling deployment
against strategy. |
Operational |
7.3. Service categorisation and grouping
The services delivered by OUCS should be reviewed to determine the required
resilience and availability levels for each service. For user facing
services (such as email) this is dependent upon the level of direct user
impact from service failures whilst for infrastructure services (such as
dns) this is determined from the impact upon user facing services (such as
Internet access).
As OUCS will have multiple machine rooms and the ability to deliver multiple
levels of resilience at the facility, hardware, network and software level
it is sensible to define a series of standard offerings to meet the range of
common availability criteria. The following table suggests a range of
recovery and continuity levels and mechanisms as an example.
| Level | Description | Implementation | Continuity or Recovery Mechanism |
|---|
| Low-medium |
Low protection. |
Single physical or logical server, UPS protected |
Repair or restore from backup. |
| Medium |
Manual recovery. |
Single logical server, UPS protected, VM image on shared
disk |
Manually invoke virtual machine on alternate hardware in same or
alternate machine room |
| Low |
No protection. |
Single physical or logical server, single feed, no UPS |
Repair or restore from backup. |
| High |
Auto Recovery |
Logical server with cold standby hardware in separate machine
room, replicated shared data |
Automatically invoke virtual machine on designated hardware in
alternate machine room. |
| Very high |
Continuity |
Active / Active logical or physical servers in separate machine
rooms |
Automatic redirection of user traffic. |
It should be noted that hardware level disk replication is only necessary for
legacy applications that offer no higher level replication of data and
should not be viewed as a strategic solution.
7.4. Service catalogue and CMDB
Many of the best practices identified above will be easier to achieve if a
comprehensive Service Catalogue is implemented and maintained. Alongside the
Service Catalogue, a Configuration Management Database should be established
and rigorously maintained for the new CMR to ensure that existing IT
equipment is both documented and controlled. All changes in location,
connections, memory, storage etc. – i.e. “configuration” – should be
recorded in detail in the CMDB.
Up: Contents Previous: 6. Understanding energy efficiency within the data centre Next: 8. Glossary