Our last two blogs on DCIM Policies discussed “Risk Management” and “Governance.” Risk Management covered Alarm, Escalation, Redundancy and Disaster Recovery Policies. Governance covered Security, Data Retention, Approval and SLA Policies.
This last part will cover “Efficiency Management”: a set of critical KPIs that form the core of a Data Center Manager’s Handbook. The Green Grid, ASHRAE and Uptime Institute have defined number of KPIs for an energy and operationally efficient data center. Typically, these KPIs appear on the DCIM dashboard. Four policies are being covered in this section: PUE Policy, Rack Load Policy, Replacement Policy and Preventive Maintenance Policy.
- PUE Policy: The power usage effectiveness (pue) metric is an industry standard for reporting energy performance of data centers. Organizations need to take several measures to ensure better pue. PUE policies in DCIM would be as follows:
a) PUE range values: A data center may define maximum acceptable average annualized PUE depending on external temperature conditions. Alerts would be sent accordingly. Newer data centers (or where DCIM has been recently implemented) which do not have a year’s PUE values maintain a daily/weekly/monthly/quarterly average.
b) UPS load: Matching UPS load to the system load improves PUE. If the UPS is only loaded to 30% capacity, efficiency will be much lower. Hence, we may define a lower threshold level of UPS load which should generate alert. An upper level load must also be defined to maintain balance of power load of the downstream devices connected.
c) Carbon Usage Effectiveness (CUE): Green Grid, the authors of PUE have also defined another metric, CUE which is dependent on PUE. Sustainability-conscious organizations, maintain CUE as another metric and may ask for this to be included as well for generating alerts.
2. Rack Load Policy: A data center must have a proper rack load policy in place in terms of power load, temperature, weight, U-space and ownership allocation. Threshold or procedure breaches in rack loads need to generate on-screen warnings or alerts.
a) Rack Power: Racks are allocated power loads, say 8KW. If already loaded with devices running up to 7.5KW, then a rejection should first happen if the workflow approval request had this Rack as an option to place a server of 900W. If the operator still attempts to configure the DCIM with this server, an on-screen warning would be displayed. If the operator still places the server, and the rack load has jumped beyond 8KW, then immediately a critical alert would be sent as per escalation policy.
b) Rack Temperature: Rack temperatures are defined under alarm settings. If temperatures exceed thresholds, alerts would be sent.
c) Rack Weight: Depending on floor load bearing capacity, a certain weight capacity is allocated for each Rack. Alerts can be configured accordingly.
d) Rack U-space: Typically some U-spaces in the rack are kept free, which should be defined. If not an alert, at least an on-screen warning should appear when an operator is committing this procedure breach.
e) Rack Ownership: Racks or even U-spaces may be allocated to a business owner. Placing a device of a different owner on this should generate a warning or alert.
3. Replacement Policy: In this policy, we define life for each category of device in the Data Center.
a) Alerts can be configured when a device is coming near end of life. This helps in decommission planning.
b) Alerts could also be set-up before the actual replacement so that affected users can make contingency plans should something go wrong during the transition.
4. Preventive Maintenance Schedule Policy: As common practice, most changes in data center are planned during non-critical periods. Preventive Maintenance and upgrade schedules with expected downtimes can be defined in DCIM. The following can then be configured:
a) Switching off non-reachability alert during this downtime
b) If actual downtime exceeds expected downtime by a certain margin, alert would be sent
c) Validating from Power and Network Chains that scheduled preventive maintenance of a device does not have a cascading impact. If it does, an alert would be generated.
Each operating procedure in the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for corrective action and where possible prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. With increased adoption of DCIM as operations, planning and management software for data centers, Standard Operating Procedures (a la Policies) must form the core of an effective DCIM.
To learn more about DCIM Policies, please read the whitepaper…
The “DCIM Policies: Automating Data Center Standard Operating Procedures” whitepaper outlines the importance of automating data center standard operating procedures, and how these policies help to avoid data center failures, help in better governance and driving efficiency improvements. Download Now.
As Cloud Computing becomes mainstream, we would see a changing role of Data Center Infrastructure Management software. While today’s DCIM Software is like an ERP for Data Centers, the Next Gen DCIM Software will be like Supply Chain Management to manage data centers in the Cloud. Whether you are a Cloud provider or a business that has its IT infrastructure with one, you should be able to better allocate your data center assets with the help of DCIM software. This means you can have fewer redundancies without compromising on uptime. Better still: you may even be able to improve your uptime through improved business continuity management provided by DCIM Software!
Why the Cloud?
Traditional data centers are designed to handle peaks in demand. On average annualized basis, less than 10% of server capacity is used during non-peak times resulting in inefficient utilization of costly resources. Since businesses cannot afford to do away with over-provisioning due to high availability requirements, the capital costs of a data center is very high.
Shifting some, or all, of their infrastructure to the Cloud gives businesses the ability to handle sudden, unanticipated and extraordinary loads. Cloud Computing provides the extra capacity to handle these peaks, through automatic provisioning from under-utilized assets at that moment of time. This is known as elasticity and it provides the resources to handle emergencies. How does it help? It controls capital costs and reduces operating expenses.
While scalability lets you plan in advance and provision your IT resources accordingly, elasticity lets you come out a winner in emergencies. Your data center doesn’t fail to deliver, your data is intact and your reputation is just as good as ever.
How is this done?
First I am going to define the term orchestration in the context of cloud computing. Orchestration refers to combining multiple and distinct automated tasks into a single workflow and provides centralized management across systems and networks including multiple devices, applications, solutions and entire data centers. It even takes care of the financial aspect of managing your IT infrastructure including, billing, metering and power consumption.
The Data-Centric Framework Management (DMF) approach (proposed by AT&T Labs and U. Penn) to cloud orchestration aims to maintain a conceptually centralized data repository of all the resources being managed including computational, storage and network devices.
The Next Gen DCIM would give businesses a unified and enhanced management interface across multiple data centers whether on-site or on the cloud.
How would the Next Gen DCIM do all this?
- It would assist with load balancing and on-demand provisioning of both physical and virtual resources and provide for broad platform compatibility across your entire IT infrastructure.
- It would act as a cloud agent or cloud brokering software allowing businesses to switch or augment provisioning between cloud providers effortlessly.
The Next Gen DCIM will be part of the management stack of Cloud Computing, and this would help to dramatically reduce costs and mitigate risks of Data Center failures.