Shekhar Dasgupta is the Founder & CEO of GreenField Software Private limited, a venture pioneering next generation data center technologies for cloud infrastructures.
Our last two blogs on DCIM Policies discussed “Risk Management” and “Governance.” Risk Management covered Alarm, Escalation, Redundancy and Disaster Recovery Policies. Governance covered Security, Data Retention, Approval and SLA Policies.
This last part will cover “Efficiency Management”: a set of critical KPIs that form the core of a Data Center Manager’s Handbook. The Green Grid, ASHRAE and Uptime Institute have defined number of KPIs for an energy and operationally efficient data center. Typically, these KPIs appear on the DCIM dashboard. Four policies are being covered in this section: PUE Policy, Rack Load Policy, Replacement Policy and Preventive Maintenance Policy.
- PUE Policy: The power usage effectiveness (pue) metric is an industry standard for reporting energy performance of data centers. Organizations need to take several measures to ensure better pue. PUE policies in DCIM would be as follows:
a) PUE range values: A data center may define maximum acceptable average annualized PUE depending on external temperature conditions. Alerts would be sent accordingly. Newer data centers (or where DCIM has been recently implemented) which do not have a year’s PUE values maintain a daily/weekly/monthly/quarterly average.
b) UPS load: Matching UPS load to the system load improves PUE. If the UPS is only loaded to 30% capacity, efficiency will be much lower. Hence, we may define a lower threshold level of UPS load which should generate alert. An upper level load must also be defined to maintain balance of power load of the downstream devices connected.
c) Carbon Usage Effectiveness (CUE): Green Grid, the authors of PUE have also defined another metric, CUE which is dependent on PUE. Sustainability-conscious organizations, maintain CUE as another metric and may ask for this to be included as well for generating alerts.
2. Rack Load Policy: A data center must have a proper rack load policy in place in terms of power load, temperature, weight, U-space and ownership allocation. Threshold or procedure breaches in rack loads need to generate on-screen warnings or alerts.
a) Rack Power: Racks are allocated power loads, say 8KW. If already loaded with devices running up to 7.5KW, then a rejection should first happen if the workflow approval request had this Rack as an option to place a server of 900W. If the operator still attempts to configure the DCIM with this server, an on-screen warning would be displayed. If the operator still places the server, and the rack load has jumped beyond 8KW, then immediately a critical alert would be sent as per escalation policy.
b) Rack Temperature: Rack temperatures are defined under alarm settings. If temperatures exceed thresholds, alerts would be sent.
c) Rack Weight: Depending on floor load bearing capacity, a certain weight capacity is allocated for each Rack. Alerts can be configured accordingly.
d) Rack U-space: Typically some U-spaces in the rack are kept free, which should be defined. If not an alert, at least an on-screen warning should appear when an operator is committing this procedure breach.
e) Rack Ownership: Racks or even U-spaces may be allocated to a business owner. Placing a device of a different owner on this should generate a warning or alert.
3. Replacement Policy: In this policy, we define life for each category of device in the Data Center.
a) Alerts can be configured when a device is coming near end of life. This helps in decommission planning.
b) Alerts could also be set-up before the actual replacement so that affected users can make contingency plans should something go wrong during the transition.
4. Preventive Maintenance Schedule Policy: As common practice, most changes in data center are planned during non-critical periods. Preventive Maintenance and upgrade schedules with expected downtimes can be defined in DCIM. The following can then be configured:
a) Switching off non-reachability alert during this downtime
b) If actual downtime exceeds expected downtime by a certain margin, alert would be sent
c) Validating from Power and Network Chains that scheduled preventive maintenance of a device does not have a cascading impact. If it does, an alert would be generated.
Each operating procedure in the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for corrective action and where possible prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. With increased adoption of DCIM as operations, planning and management software for data centers, Standard Operating Procedures (a la Policies) must form the core of an effective DCIM.
To learn more about DCIM Policies, please read the whitepaper…
The “DCIM Policies: Automating Data Center Standard Operating Procedures” whitepaper outlines the importance of automating data center standard operating procedures, and how these policies help to avoid data center failures, help in better governance and driving efficiency improvements. Download Now.
In my last blog on DCIM Policies (Part 1) we discussed about “Risk Management” and the various “DCIM Policies” that fall under this category. Now moving on to the next category that is “Governance”, which will be covered in this blog (Part 2).
Streamlined governance with chain of command, checks & balance system, and audit trails are few of the universal best practices any organization adopts to ensure voluntary or statutory compliance measures. This applies to Data Centers as well. The policies that we will cover under Governance are Security Policy, Data Retention Policy, Approval Policy and SLA Policy.
1. Security Policy: Includes role-based access. Where ever possible we must always use auditing in our environment. This will help keep track of commands run on these systems and the resulting impact. On a similar note, we must not use shared or generic accounts like "Administrator" if we can avoid it; these commands should be linked to individual accounts (preferably privileged accounts used only to perform this sort of work; we should normally use a limited account where possible).
2. Data Retention Policy- is an organization's established protocol for retaining information for operational or regulatory compliance needs. Data management and retention is a major growth area in both cost and energy consumption within the data center. It is generally recognized that a significant proportion of the data stored is either unnecessary or duplicated. Particular care should be taken to understand the impact of any data retention requirements. There are essentially three main objectives in developing a data retention policy, which can be summarized as follows:
a) To keep important records and documents for future use or reference;
b) To dispose of records or documents that are no longer needed; and
c) To organize records so they can be searched and accessed at a later date.
DCIM must provide users with secondary data storage areas which are identified by the retention policy and level of data protection. Non-editable archiving to secondary storage and purging must be automated in the data retention policy, which should also include Workflow approvals and Move-Add-Changes. Archived data also presents substantial opportunities for cost and energy savings.
3. Approval Policy for Provisioning and MACs: Provisioning of power, space, cooling and network ports when adding more customers, applications and IT devices can be a contentious one as there are conflicting demands of finite amounts of these resources. An approval process with linkages to Power and Network Chains ensures that one has not over provisioned or under provisioned any section that can lead to a power or network trip. A somewhat similar situation arises out of Move-Add-Change (MAC) – an approval process ensures that everyone knows about, agrees upon, and supports the proposed change(s). The changes and the associated approvals should be retained per the data retention policy so that one can trace back to events as well as analyze if any change resulted in an improvement or otherwise.
4. SLA Policy: An SLA in a data center contract serves 3 main purposes:
· Establishes specific levels of availability that are guaranteed by the data center.
· Sets communication protocol for any issues or uptime-impacting events that may arise.
· Lays out policies and procedures revolving around planned maintenance events by the data center (timing of such events, the communication procedures, etc.)
These agreements typically contain numerous measurable components that all revolve around meeting these key objectives. Automatic Alerts to customers have to be generated depending on allowed variance on each SLA component, which may be measured on daily, weekly, monthly or quarterly/annual basis.
Ready to learn more about DCIM Policies? Read the whitepaper…
The “DCIM Policies: Automating Data Center Standard Operating Procedures” whitepaper outlines the importance of automating data center standard operating procedures, and how these policies help to avoid data center failures, help in better governance and driving efficiency improvements. Download Now.
Come back next week with Part #3 on the trending topic of automated DCIM Policies and how it is helping in driving efficiency improvements within data centers.
Recent high profile data center outages have again brought to fore that while a lot of equipment and facility investments have been made on redundancy and disaster recovery, there is still high reliance on manual operations. Surveys have indicated that human error ranks as the second highest causal factor in data center outages. This in turn has been attributed to failure in adherence to standard operating procedures (SOPs) which are usually well defined but forgotten - or worse, not made aware to operating staff.
There are several pitfalls of keeping the SOP as a manual and not automating the procedures. The logical home for automated procedures is the DCIM (Data Center Infrastructure Management) which essentially is an Operations, Planning and Management software for a Data Center. These set of operational procedures are packaged into a “DCIM Policies” framework which link into different modules of the DCIM Software such that the DCIM detects any potential violation and sends alerts.
There are 12 key operating procedures that should be part of “DCIM Policies”. These policies broadly fall under three major categories: Risk Management, Governance and Efficiency Management. I have written this blog in 3 parts. This is Part 1 of the blog where I have discussed about the first category that is Risk Management and the DCIM policies that fall under it.
I. Risk Management: This tries to mitigate a Data Center Manager’s nightmare of an unplanned downtime, or worse an extended outage that disrupts business application availability, causes massive financial loss and damages an organization’s reputation. The policies that fall under Risk Management are Alarm Policy, Escalation Policy, Redundancy Policy and Disaster Recovery Policy.
1. Alarm Policy: helps to decide which devices and parameters need what frequency of monitoring, and defining their threshold levels in the system. Consider expected operating temperature and humidity range as an example. Ideally, we should include the operating temperature and humidity ranges at device-level, at rack-level, at the row-level: for each hot and cold aisle, and at the room-level: for general comfort of operating staff. This is a high-priority decision factor under DCIM alarm policy to prevent smoke, fire or damage to devices.
2. Escalation Policy: It is important to establish a clear-cut escalation process to know as to how and when alerts should be escalated in a data center. Escalation policy need to be developed and rehearsed to ensure the chain of command is informed and the appropriate resources are brought to bear as any situation develops. An escalation table can be defined in DCIM, which outlines the protocol, channels for escalating issues and contact personnel with the appropriate expertise.
3. Redundancy Policy: is important to be defined in DCIM depending on customer’s needs i.e. whether to have an N+1, N+2 configuration. It is not just redundant components that are important but also the process to test and make sure they work reliably such as scheduled failover drills and research into new methodologies. If we cannot have two, we need to figure out how we can cobble together a replacement system if the primary equipment becomes unavailable or fails.
4. Disaster Recovery Policy: It is crucial to have a disaster recovery plan in place with metrics of RTO (Recovery Time Objective) and RPO (Recovery Point Objective) well defined in the SLA. A data center disaster is considered when none of the redundancy options are available: a complete power outage, for example, is a disaster. In such a situation, how quickly can we recover to get at least one section of the data center up and running (RTO). How much longer will it take to recover to the point before the initial power failure leading to complete outage (RPO).
Every process/operating procedure involved within the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for immediate corrective action and where possible even prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. In my next blog I will share about streamlined governance and best practices that apply to data centers. For more information on how to derive benefits from DCIM policy based systems, download the Greenfield Software white paper “DCIM Policies: Automating Data Center Standard Operating Procedures”.
Data Center Infrastructure Management (DCIM) software has now come out of the shadows of an emerging technology it had started with a decade ago as a tool integrating Building Management System (BMS) and a Systems Management software. It has matured as a Data Center Operations, Planning and Management software vital for Data Center staff, managers as well as the CIO. Of course, evolution will further continue with Software Defined Data Centers (SDDCs) and DCIM becoming a natural inheritor of an Industrial Internet of Things (IoT) application.
Beyond monitoring and sending critical alerts, DCIM addresses the common data center operation problems that happen due to the absence of a comprehensive, up-to-date and accurate asset database. A Move-Add-Change operation goes untracked in a spreadsheet. DCIM automates this function with workflow-based approvals. The asset database itself can be created through an auto-discovery process and populating static attributes from in-built OEM libraries. Please click here to see this at work.
Beyond Data Center Operations, DCIM helps in Planning and Management. It helps identify the best-fit racks to place a new server, based on available space and remaining allocated power. It provides alerts when a Preventive Maintenance date is due. So downtime can be properly planned especially when there is possibility of downstream cascading impact. A Power Chain provides ready visualization to help avoid an adverse situation.
Finally, DCIM provides Data Center metrics. On the dashboard one can get rack space utilization, PUE and rack, row & room temperature for the month as against the SLA terms.
In a world of always-on service delivery, data center failures are unthinkable. The financial implication and loss of reputation alone make it imperative that we put in place systems that prevent such failures. At the same time, CFOs and CIOs are grappling with increasing capital expenditures and operating costs involved in running a High Availability Data Center.
GFS Crane DCIM delivers both on the High Availability promise as well enabling a leaner, greener and operationally more efficient data center. For more on GFS Crane DCIM, please see our new brochure.
We are privileged to be again participating in Intel DCM Booth at DCD New York on 19th & 20th April, 2016 at Marriott Marquis, Times Square.
We will be announcing our new version GFS Crane DCIM v3.0 with a number of enhancements.
With these enhancements, we will show how GFS Crane is a complete DCIM for Enterprise as well as Colocation Data Centers. We will demonstrate how it helps in Data Center Operations, Planning & Management.
We will demonstrate the integration with Intel Data Center Manager and how that reduces cost of ownership for better capacity planning, power, space and cooling management.
Last month, we did a joint webinar with Intel DCM Group where we showed how GFS Crane mitigates risks of data center failures. Those who missed that may see the recorded version. If you are attending DCD New York, please visit us at the Intel DCM booth to see GFS Crane as a Policy-driven DCIM. Threshold levels for each monitored parameter and SLA terms are defined within the Policies section of GFS Crane DCIM. When a threshold level is breached Active Alerts are sent to designated users via email or text messages or sent as SNMP traps to an ITSM tool. Dashboards and analytic reports provide Data Center Managers key metrics against SLA terms such as average PUE for the month, temperature range and uptime standards for critical devices.
You can read more from a recent case study of GFS Crane DCIM with a telecom operator.
Our recent DCIM implementations provide an insight to actual use pattern with enterprise DCIM customers in the South Asia region. While analyst reports, mostly focused on the North American and Western Europe markets, suggest Energy Efficiency, Capacity Planning and Compliance considerations as foremost reasons for DCIM deployment, here’s our observations:
- While 80% of our enterprise data center customers have licensed for GFS Crane DCIM full suite, the principal (but not only) reason for their deployment was to prevent a data center outage.
- To prevent this outage, customers needed instant monitoring and getting alerts from all critical infrastructures.
- If customer had a BMS, DCIM had to integrate with that.
- If customer did not have a BMS, then DCIM had to integrate directly with the devices, and specifically with those they perceived as the MOST critical, or the weakest link in the chain.
- The other DCIM functions in order of importance were: management dashboards with KPIs, data center visualization, asset and change management (with workflow approvals and audit trails) and capacity planning.
The above use patterns are for enterprise DCIM, as against DCIM in multi-tenant data centers who of course have additional reasons for DCIM deployment like automating customer on-boarding process, capacity planning and power/space inventory management, energy billing and offering customer portals for self-service.
Most enterprise data centers in India are less than 50 Racks and a large proportion do not have BMS or instrumentation for monitoring of physical infrastructure. They rely on periodic manual monitoring taking readings from device consoles, room thermometers and hand-held power meters. The inadequacy of this archaic approach is obvious to all. Hence the options are BMS, DCIM or a combination of both, latter two when customers are looking beyond monitoring and sending alerts.
The weakest link with one customer, operating in a region with daily twelve-hour power outages, were DG sets and fuel supply. Hence, GFS Crane DCIM had to offer a comprehensive fuel automation system including 24x7 hour monitoring of DG sets and fuel tanks and controlling fuel levels in the tanks.
With a High Performance Computing customer, paranoid about poor power quality or extended power outage damaging expensive equipment, GFS Crane DCIM provided extensive alerts as well as analytics, not just on individual UPS devices, but also on banks of them with DR policies defined within the DCIM. Passive alerts were converted to actionable instructions for preventing an application outage, and quickly isolating any expensive equipment from such power related incidents. Of course, both these customers are also benefiting from GFS Crane DCIM’s comprehensive asset & change management, capacity planning, power and environment management capabilities across both physical as well as IT infrastructure – the latter with Intel Data Center Manager.
I take this opportunity to wish all our customers, partners and visitors to our web site a Very Happy & Prosperous New Year.
GFS is a Knowledge Partner at DCD CONVERED Bangalore taking place this week. We are unveiling here the integrated solution of GFS Crane DCIM with Intel Data Center Manager. DCD Attendees can see this tomorrow (16th July) at Intel Booth.
GFS is sponsoring a Panel Discussion this afternoon “Key Metrics for Data Center Operations to Measure Reliability & Efficiency.” Moderated by our Vice President – Development, Parikshit Bhaduri, the panel includes
- Mr. N. Subramanyam, CIO, Diageo
- Mr. Rupinder Goel, Global CIO, Tata Communications
- Ms. Shaheen Meeran, Managing Director, Schnabel DC Consultants India
Data Center Metrics is a growing subject in data center operations management. Starting from the days when Power Usage Effectiveness (PUE) was first defined, Green Grid subsequently expanded with multi-level PUE measurements, Carbon Usage Effectiveness (CUE), and Water Usage Effectiveness (WUE) to round-off a set of sustainability metrics for the data center.
Other organizations, notably eBay, defined other measurement criteria for different data center roles. This will be the topic for today’s GFS sponsored panel. DCIM is expected to deliver all the key metric s for different data center roles real-time, to indicate not only the current state of health of the data center, but to enable instant corrective actions (if required) to maintain uptime and SLAs.
An example of this comes from our collaboration with Intel. Integrating Intel Data Center Manager gives us interesting measures on Rack Utilization Effectiveness (RUE) with respect to their Power, Space and Temperature on real-time basis. Instant view of RUEs helps in accurate Move-Add-Change (AC) decisions as well as Capacity Planning and Forecasting.
In the first part of this blog on Business Analytics for Data Centers, we explored why Analytics has become critical for Data Center operations . In this second part, we will explore how DCIM fulfills this role as a Business Analytic tool for Data Center operations.
While DCIM in its early days was largely seen as a bridge between Facilities and the IT Infrastructure Groups, it is now being recognized as an analytic tool for data center operations. Maturity in DCIM technology has meant that huge amounts of data from different devices are captured on a real time basis. Data Center Managers rightly expect that DCIM must now be more than just a monitoring tool and deliver meaningful insights from the data lake of power and environment monitoring, server utilization and threshold breaches.
At configuration stage, DCIM is mapped with the critical relationships and dependencies between all the assets, applications and business units in the data center. This makes it possible to identify cascading impacts of an impending failure. DCIM Analytics however goes deeper. Over a period of time, data patterns emerge which lend themselves to modern predictive and prescriptive analytics. Predictive analytics gives the data center team enough time to take measures to either avoid or reduce the impact of the failure when it happens. Prescriptive analytics, on the other hand, provides suggestions on how to achieve or improve benchmark levels on each of the metrics specified in advance.
DCIM works with environment probes that measure rack, row and room temperatures and humidity levels. Analytics can help to determine which areas in the data center need more cooling than others and even which PAC unit may be turned off in the data center at certain times of the day or month. Advanced DCIM, through analytics, recommends ways to reduce power consumption in the data center by raising temperature in zones that do not need extra cooling.
Other Benefits Using DCIM
There is a frequent Move-Add-Change (MAC) in data centers. DCIM has the capacity to deal with these MACs, as well as sudden surges in demand for data center resources. This works especially well with multiple virtual servers in the cloud. Most businesses today do not own just one data center housed in a single location – their data centers are spread around the world. Some are in-house and others are hosted by third-parties. DCIM is the only technology that lets business users control all their data center assets and resources from a single platform.
Data centers are notorious for their high power consumption. Advanced DCIM provides business and operational intelligence to maximize rack space use, minimize power distribution losses and optimize cooling while ensuring the data center meets SLA standards for temperature, availability and energy efficiency metrics like PUE (Power Usage Effectiveness).
Most businesses are finding it hard to make most of the existing space in their data centers, and the use of DCIM software mitigates this problem to a great extent. DCIM can help with reduced rack and floor space utilization, by providing detailed real-time reports on server utilization and capacity. Server utilization reports provide suggestions which of them can be decommissioned or virtualized and therefore overcome space constraints in the data center.
Finally, the most important function of DCIM is to prevent data center failures which can permanently damage the reputation of a business. In an age when a major data center failure can prove fatal for a business, DCIM provides monitoring as well as predictive analytic capability to prevent such a disaster.
A new breed of Data Center Infrastructure Management (DCIM) software is now emerging out of the shadows of being just a monitoring and tracking tool. Advanced DCIM are now providing the much needed Business Analytics for data centers. This can be a boon for both C-level executives and data center managers looking to cut costs while meeting demands for High Availability. This is a two-part blog. In this first part, we explore why Analytics has become critical for Data Center operations in the new world order of Internet of Things.
As data centers are growing in complexity, the need to keep them functioning at an optimum level, while cutting down on costs, is a challenge facing both the CIO as well as the CFO. Large businesses are spending millions to keep their data centers up and running and it is directly affecting their bottom line and ROI. Companies can no longer afford to let their data centers run under-utilized, nor can they afford failures. Sadly, most organizations are struggling to make the most of their data center investments.
Business Analytics and DCIM – An Introduction
Business Analytics software provides a broad set of capabilities for gathering and processing business data, and includes functions such as reporting, analysis, modeling and forecasting - all of which give business users the ability to make informed decisions and initiate actions directly from their dashboards.
In order to understand how a few of the advanced Data Center Infrastructure Management (DCIM) software provides similar capabilities for data centers, we have to first look at the challenge of running data centers effectively and at minimal cost. While the foremost responsibility of the Data Center Manager is maintaining High Availability, the challenge, somewhat ironical, can be summed up in one sentence:
Extreme redundancies with lots of assets increase the vulnerable points!
Not to mention, they also consume large amount of resources, and typically remain under-utilized.
Data center assets comprise both physical as well as IT infrastructure. The resources to keep them running include space and networks and also power and cooling without which the assets would not be able to function.
Advanced DCIM gives data center operators the ability to manage all their data center assets and resources from a single dashboard. Through real-time monitoring of all assets and resources, they can determine correlations between different parameters, thereby making their DCIM a powerful platform for deep analytics and business intelligence. DCIM Analytics ensures that all data center assets are in good health while consuming the least amount of resources and provides complete visibility to the power chain, enabling tracking and eliminating potential vulnerable points of failures.
In the second part, we will explore "How DCIM Business Analytics Works."
As we approach end of another year, it is time to reflect what is in store for us in the Data Center world in 2015 and beyond.
Up until now, from the dot com boom days, we witnessed first the proliferation and later consolidation of data centers.
Now that is about to change. Gartner recently pointed out there is a new disruption about to happen in the data center world. Growth of Internet of Things (IOT) would churn out hitherto unseen volumes of data every minute from zillions of Internet-enabled wearables, devices and industrial equipment all around the world. This data would have to be analyzed instantaneously for operational intelligence, predictive and prescriptive analytics and even national security. This would have major impact on data center operations.
The current paradigm of centralized operations means transmitting data over thousands of miles from point of data origination to central servers and then sending back to originating point and elsewhere, again over thousands of miles, and a few times over. This would require massive bandwidths and massive computing power. Instead we will see an emergence of a “glocal” model, with distributed data centers serving immediate local needs and then transmitting raw as well as processed data to centralized processing centers for deeper analysis for global consumption.
When this happens, it will bring about an overhaul of business processes and have a transformational effect on data center operations. Data Center operations will look like an intricate supply chain network, involving business partners, transfer pricing, and running transportation and cost optimization models. Data Center operations will be software-driven, even transcending what is currently defined within the boundaries of a Software Driven Data Center (SDDC), which is primarily about virtualization of computing resources. Data Center Infrastructure Management (DCIM) will be at the heart of this transformation, with a workflow-driven Business Process Management (BPM) layer that would auto-enable data traffic via a set of algorithms working on constraint-based planning.
The use cases of Internet of Things, from consumer efficiency to health care to transportation and even agriculture is fascinating. What is less written about is the profound impact this will have back-end on the Data Centers that would have to support this new era of a ubiquitous connected world.