• Home
  • Blog
  • DCIM Policies: Automating Data Center Standard Operating Procedures (Part 1)

DCIM Policies: Automating Data Center Standard Operating Procedures (Part 1) Featured

Written by  Thursday, 27 July 2017 11:03
Rate this item
(1 Vote)

Recent high profile data center outages have again brought to fore that while a lot of equipment and facility investments have been made on redundancy and disaster recovery, there is still high reliance on manual operations. Surveys have indicated that human error ranks as the second highest causal factor in data center outages. This in turn has been attributed to failure in adherence to standard operating procedures (SOPs) which are usually well defined but forgotten - or worse, not made aware to operating staff.

There are several pitfalls of keeping the SOP as a manual and not automating the procedures. The logical home for automated procedures is the DCIM (Data Center Infrastructure Management) which essentially is an Operations, Planning and Management software for a Data Center. These set of operational procedures are packaged into a “DCIM Policies” framework which link into different modules of the DCIM Software such that the DCIM detects any potential violation and sends alerts.

There are 12 key operating procedures that should be part of “DCIM Policies”. These policies broadly fall under three major categories: Risk Management, Governance and Efficiency Management. I have written this blog in 3 parts. This is Part 1 of the blog where I have discussed about the first category that is Risk Management and the DCIM policies that fall under it.

I.               Risk Management: This tries to mitigate a Data Center Manager’s nightmare of an unplanned downtime, or worse an extended outage that disrupts business application availability, causes massive financial loss and damages an organization’s reputation.  The policies that fall under Risk Management are Alarm Policy, Escalation Policy, Redundancy Policy and Disaster Recovery Policy.

 1.       Alarm Policy: helps to decide which devices and parameters need what frequency of monitoring, and defining their threshold levels in the system. Consider expected operating temperature and humidity range as an example. Ideally, we should include the operating temperature and humidity ranges at device-level, at rack-level, at the row-level: for each hot and cold aisle, and at the room-level: for general comfort of operating staff. This is a high-priority decision factor under DCIM alarm policy to prevent smoke, fire or damage to devices.

2.       Escalation Policy: It is important to establish a clear-cut escalation process to know as to how and when alerts should be escalated in a data center. Escalation policy need to be developed and rehearsed to ensure the chain of command is informed and the appropriate resources are brought to bear as any situation develops. An escalation table can be defined in DCIM, which outlines the protocol, channels for escalating issues and contact personnel with the appropriate expertise.

3.       Redundancy Policy: is important to be defined in DCIM depending on customer’s needs i.e. whether to have an N+1, N+2 configuration. It is not just redundant components that are important but also the process to test and make sure they work reliably such as scheduled failover drills and research into new methodologies. If we cannot have two, we need to figure out how we can cobble together a replacement system if the primary equipment becomes unavailable or fails.

4.       Disaster Recovery Policy: It is crucial to have a disaster recovery plan in place with metrics of RTO (Recovery Time Objective) and RPO (Recovery Point Objective) well defined in the SLA. A data center disaster is considered when none of the redundancy options are available: a complete power outage, for example, is a disaster. In such a situation, how quickly can we recover to get at least one section of the data center up and running (RTO). How much longer will it take to recover to the point before the initial power failure leading to complete outage (RPO).

Every process/operating procedure involved within the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for immediate corrective action and where possible even prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. In my next blog I will share about streamlined governance and best practices that apply to data centers. For more information on how to derive benefits from DCIM policy based systems, download the Greenfield Software white paper DCIM Policies: Automating Data Center Standard Operating Procedures”.

Read 1893 times
Shekhar Dasgupta

Shekhar Dasgupta is the Founder & CEO of GreenField Software Private limited, a venture pioneering next generation data center technologies for cloud infrastructures.

Leave a comment

Make sure you enter the (*) required information where indicated. HTML code is not allowed.

How does the Demo Work?

  1. Submit your work Email
  2. Receive email with Login Access
  3. Login & Evaluate GFS Crane

Get a FREE Demo Now

For Technical Support, please email support@greenfieldsoft.com

Client Testimonials

Kali Mahapatra
AVP - IT Infrastructure & INFOCOM, ABP
October 29, 2014
As a leading media house in India, we have national presence in print for dailies and periodicals, and new media presence in both television and Internet. As can be well ...
M.D. Agrawal
GM IS, BPCL Refinery
January 30, 2014
GFS Crane is implemented at our Mumbai Refinery Data Center which operates 24x7. Being part of energy-intensive operations, we wanted to save energy and have an environme...
Kumanan Vetrivel
Senior Director-IT, IDFC Limited
March 3, 2013
We are very happy to collaborate with GreenField Software for GFS Crane DC. Being India's leading financial services institution with sustainability management as one of ...

GFS In the News

Show More News