Our last two blogs on DCIM Policies discussed “Risk Management” and “Governance.” Risk Management covered Alarm, Escalation, Redundancy and Disaster Recovery Policies. Governance covered Security, Data Retention, Approval and SLA Policies.
This last part will cover “Efficiency Management”: a set of critical KPIs that form the core of a Data Center Manager’s Handbook. The Green Grid, ASHRAE and Uptime Institute have defined number of KPIs for an energy and operationally efficient data center. Typically, these KPIs appear on the DCIM dashboard. Four policies are being covered in this section: PUE Policy, Rack Load Policy, Replacement Policy and Preventive Maintenance Policy.
- PUE Policy: The power usage effectiveness (pue) metric is an industry standard for reporting energy performance of data centers. Organizations need to take several measures to ensure better pue. PUE policies in DCIM would be as follows:
a) PUE range values: A data center may define maximum acceptable average annualized PUE depending on external temperature conditions. Alerts would be sent accordingly. Newer data centers (or where DCIM has been recently implemented) which do not have a year’s PUE values maintain a daily/weekly/monthly/quarterly average.
b) UPS load: Matching UPS load to the system load improves PUE. If the UPS is only loaded to 30% capacity, efficiency will be much lower. Hence, we may define a lower threshold level of UPS load which should generate alert. An upper level load must also be defined to maintain balance of power load of the downstream devices connected.
c) Carbon Usage Effectiveness (CUE): Green Grid, the authors of PUE have also defined another metric, CUE which is dependent on PUE. Sustainability-conscious organizations, maintain CUE as another metric and may ask for this to be included as well for generating alerts.
2. Rack Load Policy: A data center must have a proper rack load policy in place in terms of power load, temperature, weight, U-space and ownership allocation. Threshold or procedure breaches in rack loads need to generate on-screen warnings or alerts.
a) Rack Power: Racks are allocated power loads, say 8KW. If already loaded with devices running up to 7.5KW, then a rejection should first happen if the workflow approval request had this Rack as an option to place a server of 900W. If the operator still attempts to configure the DCIM with this server, an on-screen warning would be displayed. If the operator still places the server, and the rack load has jumped beyond 8KW, then immediately a critical alert would be sent as per escalation policy.
b) Rack Temperature: Rack temperatures are defined under alarm settings. If temperatures exceed thresholds, alerts would be sent.
c) Rack Weight: Depending on floor load bearing capacity, a certain weight capacity is allocated for each Rack. Alerts can be configured accordingly.
d) Rack U-space: Typically some U-spaces in the rack are kept free, which should be defined. If not an alert, at least an on-screen warning should appear when an operator is committing this procedure breach.
e) Rack Ownership: Racks or even U-spaces may be allocated to a business owner. Placing a device of a different owner on this should generate a warning or alert.
3. Replacement Policy: In this policy, we define life for each category of device in the Data Center.
a) Alerts can be configured when a device is coming near end of life. This helps in decommission planning.
b) Alerts could also be set-up before the actual replacement so that affected users can make contingency plans should something go wrong during the transition.
4. Preventive Maintenance Schedule Policy: As common practice, most changes in data center are planned during non-critical periods. Preventive Maintenance and upgrade schedules with expected downtimes can be defined in DCIM. The following can then be configured:
a) Switching off non-reachability alert during this downtime
b) If actual downtime exceeds expected downtime by a certain margin, alert would be sent
c) Validating from Power and Network Chains that scheduled preventive maintenance of a device does not have a cascading impact. If it does, an alert would be generated.
Each operating procedure in the data center should have a policy behind it to help keep the environment maintained and managed. Deviations from acceptable range should be automatically detected for corrective action and where possible prevent a violation. Besides helping to avoid data center failures, automated policies help in better governance and driving efficiency improvements. With increased adoption of DCIM as operations, planning and management software for data centers, Standard Operating Procedures (a la Policies) must form the core of an effective DCIM.
To learn more about DCIM Policies, please read the whitepaper…
The “DCIM Policies: Automating Data Center Standard Operating Procedures” whitepaper outlines the importance of automating data center standard operating procedures, and how these policies help to avoid data center failures, help in better governance and driving efficiency improvements. Download Now.
Data Center Infrastructure Management (DCIM) software has now come out of the shadows of an emerging technology it had started with a decade ago as a tool integrating Building Management System (BMS) and a Systems Management software. It has matured as a Data Center Operations, Planning and Management software vital for Data Center staff, managers as well as the CIO. Of course, evolution will further continue with Software Defined Data Centers (SDDCs) and DCIM becoming a natural inheritor of an Industrial Internet of Things (IoT) application.
Beyond monitoring and sending critical alerts, DCIM addresses the common data center operation problems that happen due to the absence of a comprehensive, up-to-date and accurate asset database. A Move-Add-Change operation goes untracked in a spreadsheet. DCIM automates this function with workflow-based approvals. The asset database itself can be created through an auto-discovery process and populating static attributes from in-built OEM libraries. Please click here to see this at work.
Beyond Data Center Operations, DCIM helps in Planning and Management. It helps identify the best-fit racks to place a new server, based on available space and remaining allocated power. It provides alerts when a Preventive Maintenance date is due. So downtime can be properly planned especially when there is possibility of downstream cascading impact. A Power Chain provides ready visualization to help avoid an adverse situation.
Finally, DCIM provides Data Center metrics. On the dashboard one can get rack space utilization, PUE and rack, row & room temperature for the month as against the SLA terms.
In a world of always-on service delivery, data center failures are unthinkable. The financial implication and loss of reputation alone make it imperative that we put in place systems that prevent such failures. At the same time, CFOs and CIOs are grappling with increasing capital expenditures and operating costs involved in running a High Availability Data Center.
GFS Crane DCIM delivers both on the High Availability promise as well enabling a leaner, greener and operationally more efficient data center. For more on GFS Crane DCIM, please see our new brochure.
Our recent DCIM implementations provide an insight to actual use pattern with enterprise DCIM customers in the South Asia region. While analyst reports, mostly focused on the North American and Western Europe markets, suggest Energy Efficiency, Capacity Planning and Compliance considerations as foremost reasons for DCIM deployment, here’s our observations:
- While 80% of our enterprise data center customers have licensed for GFS Crane DCIM full suite, the principal (but not only) reason for their deployment was to prevent a data center outage.
- To prevent this outage, customers needed instant monitoring and getting alerts from all critical infrastructures.
- If customer had a BMS, DCIM had to integrate with that.
- If customer did not have a BMS, then DCIM had to integrate directly with the devices, and specifically with those they perceived as the MOST critical, or the weakest link in the chain.
- The other DCIM functions in order of importance were: management dashboards with KPIs, data center visualization, asset and change management (with workflow approvals and audit trails) and capacity planning.
The above use patterns are for enterprise DCIM, as against DCIM in multi-tenant data centers who of course have additional reasons for DCIM deployment like automating customer on-boarding process, capacity planning and power/space inventory management, energy billing and offering customer portals for self-service.
Most enterprise data centers in India are less than 50 Racks and a large proportion do not have BMS or instrumentation for monitoring of physical infrastructure. They rely on periodic manual monitoring taking readings from device consoles, room thermometers and hand-held power meters. The inadequacy of this archaic approach is obvious to all. Hence the options are BMS, DCIM or a combination of both, latter two when customers are looking beyond monitoring and sending alerts.
The weakest link with one customer, operating in a region with daily twelve-hour power outages, were DG sets and fuel supply. Hence, GFS Crane DCIM had to offer a comprehensive fuel automation system including 24x7 hour monitoring of DG sets and fuel tanks and controlling fuel levels in the tanks.
With a High Performance Computing customer, paranoid about poor power quality or extended power outage damaging expensive equipment, GFS Crane DCIM provided extensive alerts as well as analytics, not just on individual UPS devices, but also on banks of them with DR policies defined within the DCIM. Passive alerts were converted to actionable instructions for preventing an application outage, and quickly isolating any expensive equipment from such power related incidents. Of course, both these customers are also benefiting from GFS Crane DCIM’s comprehensive asset & change management, capacity planning, power and environment management capabilities across both physical as well as IT infrastructure – the latter with Intel Data Center Manager.
I take this opportunity to wish all our customers, partners and visitors to our web site a Very Happy & Prosperous New Year.
Most of us are familiar with the quote that “if you cannot measure you cannot manage”. In all fields, spanning technology and management, a set of metrics are established to measure against stated objectives. The metrics should tell the stakeholders about how the system is performing. The metrics on a business can be from several different perspectives: financial, customer satisfaction, environmental impact etc. Just one aspect, such as financials, do not tell the whole story. If the board of a company looks at only the financial aspect ignoring other areas, it may be myopic. Today a company may be doing fine from financial metrics such as EPS, revenue, profitability numbers. However, if customer satisfaction index and its brand value due to environmental impact are poor, it doesn’t augur well for the company. Similarly, a data center needs to be viewed from different angles: cost efficiency, power consumption, reliability, customer satisfaction to make the measurement all rounded.
PUE – is that the only metric needed in a data center?
PUE – Power Usage Effectiveness is the most well-known of all data center metrics. At the core of the data center are the computing units – server, storage, switches, which runs the application, stores the data, and communicates internally/externally. One of the primary cost of running a data center is the power consumed. The power consumption has two components: power consumed by computing units and power consumed by rest of the facilities equipment such as cooling. The PUE is calculated by dividing the total power consumed by the data center with power consumed by the computing units. The lower the PUE the more efficient the data center is. If the PUE of a data center is 2 it means 50% of the power is used by computing units. Now if we can bring down the total power assuming that the power drawn by computing units remain the same, then we have increased the efficiency by reducing the overhead of such functions as cooling.
The importance of PUE cannot be denied and every data center should strive to get it as close to 1 as possible. However, PUE is not the only metric. The data centers have to consider several other metrics. Furthermore, PUE can also be deceptive. For e.g., if one replaces the computing units by something which consumes less power , the total power drawn will be less but PUE will increase. For similar reasons PUE cannot be used to compare data centers. If a data center is running mostly on renewable energy then its impact on environment is marginal even though its PUE may be slightly worse than PUE of comparable data centers running on conventional energy.
Reliability and availability
A data center not only needs to be efficient from a cost and power perspective, it needs to be reliable and available, considering that most data centers are running business critical applications as more and more applications are hosted on the cloud. No customer will tolerate partial downtime, let alone for the whole data center. Hence the metrics which measure reliability and availability are important. The metrics that measure availability for assets such as MTBF (Mean Time between Failures) and MTTR (Mean Time to Repair) are important and should be measured. The other measure of reliability is the number and category of alarms being raised in the data center and how quickly the alarms are being responded to.
A data center needs to be customer centric, gone are the days when a data center ran outside the glare of the core business. Today it is intimately connected with a business whether it is a captive data center or a data center providing facilities for others. A captive data center runs the core business of different LOBs and it needs to respond to the needs of the LOBs. A data center, which provides colocation and hosting services, has to be customer centric in its operations. It has to ensure customer provisioning requests are satisfied and any customer ticket closed with satisfactory SLA. So for data centers, captive or otherwise, compliance with SLA is extremely important and that can be measured by provisioning request or service tickets that fall outside the SLA – percent not meeting SLA. Closely tied with customer satisfaction is the capacity of a data center. As long as the data center has sufficient capacity in terms of power, cooling and resources it will be able to service provisioning request quickly. Hence measuring the capacity at all times is paramount for a data center.
I had recently hosted a panel discussion on data center metrics and the panelists pretty much concluded that metrics is extremely important for a data center operations and the metrics need to be viewed for the different areas, as outlined above. Also with the availability of DCIM software from companies such as Greenfield it is easy to capture and view these metrics on a real time basis. Greenfield’s software GFS Crane provides dashboard with key metrics such as PUE, availability, capacity utilization etc. In addition, one can have drill down reports to see a granular view. With automation, provided by such software such as GFS Crane, it is easy to stay on top of things and react with agility as situation changes or take pro-active steps wherever possible.
In the first part of this blog on Business Analytics for Data Centers, we explored why Analytics has become critical for Data Center operations . In this second part, we will explore how DCIM fulfills this role as a Business Analytic tool for Data Center operations.
While DCIM in its early days was largely seen as a bridge between Facilities and the IT Infrastructure Groups, it is now being recognized as an analytic tool for data center operations. Maturity in DCIM technology has meant that huge amounts of data from different devices are captured on a real time basis. Data Center Managers rightly expect that DCIM must now be more than just a monitoring tool and deliver meaningful insights from the data lake of power and environment monitoring, server utilization and threshold breaches.
At configuration stage, DCIM is mapped with the critical relationships and dependencies between all the assets, applications and business units in the data center. This makes it possible to identify cascading impacts of an impending failure. DCIM Analytics however goes deeper. Over a period of time, data patterns emerge which lend themselves to modern predictive and prescriptive analytics. Predictive analytics gives the data center team enough time to take measures to either avoid or reduce the impact of the failure when it happens. Prescriptive analytics, on the other hand, provides suggestions on how to achieve or improve benchmark levels on each of the metrics specified in advance.
DCIM works with environment probes that measure rack, row and room temperatures and humidity levels. Analytics can help to determine which areas in the data center need more cooling than others and even which PAC unit may be turned off in the data center at certain times of the day or month. Advanced DCIM, through analytics, recommends ways to reduce power consumption in the data center by raising temperature in zones that do not need extra cooling.
Other Benefits Using DCIM
There is a frequent Move-Add-Change (MAC) in data centers. DCIM has the capacity to deal with these MACs, as well as sudden surges in demand for data center resources. This works especially well with multiple virtual servers in the cloud. Most businesses today do not own just one data center housed in a single location – their data centers are spread around the world. Some are in-house and others are hosted by third-parties. DCIM is the only technology that lets business users control all their data center assets and resources from a single platform.
Data centers are notorious for their high power consumption. Advanced DCIM provides business and operational intelligence to maximize rack space use, minimize power distribution losses and optimize cooling while ensuring the data center meets SLA standards for temperature, availability and energy efficiency metrics like PUE (Power Usage Effectiveness).
Most businesses are finding it hard to make most of the existing space in their data centers, and the use of DCIM software mitigates this problem to a great extent. DCIM can help with reduced rack and floor space utilization, by providing detailed real-time reports on server utilization and capacity. Server utilization reports provide suggestions which of them can be decommissioned or virtualized and therefore overcome space constraints in the data center.
Finally, the most important function of DCIM is to prevent data center failures which can permanently damage the reputation of a business. In an age when a major data center failure can prove fatal for a business, DCIM provides monitoring as well as predictive analytic capability to prevent such a disaster.
A new breed of Data Center Infrastructure Management (DCIM) software is now emerging out of the shadows of being just a monitoring and tracking tool. Advanced DCIM are now providing the much needed Business Analytics for data centers. This can be a boon for both C-level executives and data center managers looking to cut costs while meeting demands for High Availability. This is a two-part blog. In this first part, we explore why Analytics has become critical for Data Center operations in the new world order of Internet of Things.
As data centers are growing in complexity, the need to keep them functioning at an optimum level, while cutting down on costs, is a challenge facing both the CIO as well as the CFO. Large businesses are spending millions to keep their data centers up and running and it is directly affecting their bottom line and ROI. Companies can no longer afford to let their data centers run under-utilized, nor can they afford failures. Sadly, most organizations are struggling to make the most of their data center investments.
Business Analytics and DCIM – An Introduction
Business Analytics software provides a broad set of capabilities for gathering and processing business data, and includes functions such as reporting, analysis, modeling and forecasting - all of which give business users the ability to make informed decisions and initiate actions directly from their dashboards.
In order to understand how a few of the advanced Data Center Infrastructure Management (DCIM) software provides similar capabilities for data centers, we have to first look at the challenge of running data centers effectively and at minimal cost. While the foremost responsibility of the Data Center Manager is maintaining High Availability, the challenge, somewhat ironical, can be summed up in one sentence:
Extreme redundancies with lots of assets increase the vulnerable points!
Not to mention, they also consume large amount of resources, and typically remain under-utilized.
Data center assets comprise both physical as well as IT infrastructure. The resources to keep them running include space and networks and also power and cooling without which the assets would not be able to function.
Advanced DCIM gives data center operators the ability to manage all their data center assets and resources from a single dashboard. Through real-time monitoring of all assets and resources, they can determine correlations between different parameters, thereby making their DCIM a powerful platform for deep analytics and business intelligence. DCIM Analytics ensures that all data center assets are in good health while consuming the least amount of resources and provides complete visibility to the power chain, enabling tracking and eliminating potential vulnerable points of failures.
In the second part, we will explore "How DCIM Business Analytics Works."
Not unlike Network Management Systems (NMS), Data Center Infrastructure Management (DCIM) software also monitors diverse set of equipment. The equipment ranges from server, network switches, Power Distribution Units (PDUs), panels, sensors, Diesel Generator (DG) sets. These devices have different protocols – MODBUS, SNMP, and BACNET. In addition, the parameters, that are monitored, are also different. For example the monitored parameters from a DG set may be output voltage, output power, and output current for all phases. Now in the case of a sensor it may be temperature and relative humidity. The software needs to capture the data from the various devices, keep in persistent store and report/alert on the data. This poses a problem if we want to store in traditional row/column format of relational data base. We will explore the implementation options and the method adopted.
Implementation Options in RDBMS
If we choose to store the monitored data in traditional relational form we have couple of options:
Build a super set of column list from all the monitored devices
If we choose this option then let’s say that we have 3 devices A, B & C and for A the monitored parameters are x, y, for B the monitored parameters are y, z, and for C the monitored parameters are x, z. So if we have a table with columns x, y and z it should suffice. Well in the real world the number of devices can run into hundreds of types with each device having multiple unique parameters. In that case you will see that the number of columns will easily run into few hundreds making the table design unwieldy. Furthermore, when it is populated with data it will be sparse. Of course, every time a new device is added with unique parameter, one will have to add columns to the table making the design untenable.
Have a table per device
This approach is somewhat better than the previous one - in the design add a table, which is unique to the type of device. For example there will be a table for DG set with columns for parameters that are monitored for a DG set, a table for a sensor with temperature and relative humidity as columns, so on and so forth. It sounds logical. However, this design also suffers from similar deficiencies as stated above. Let us say you have 2 DG sets from two different manufacturers and their monitored parameters, although having overlaps, are not exactly same. So what do we have to do – add two different tables for 2 DG sets? There goes the design principle for a toss!
How to retrofit in a RDBMS based solution?
Having described the issues that we encountered, how do we design the persistence of monitored data? The natural choice would have been NOSQL databases such as Cassandra or similar persistent store. The NOSQL data model is a dynamic schema, column-oriented data model. This means that, unlike a relational database, you do not need to model all of the columns required by your application up front, as each row is not required to have the same set of columns. Columns and their metadata can be added by your application as they are needed without incurring downtime to your application.Since we had to retrofit the design into an already existing relational schema, we chose have a single column of text (varchar field in RDBMS terminology) sufficiently large to hold the monitored data. However, we devised a scheme such that when we acquire data we say what field it is, what is the unit and what is the value. For example if from a sensor we acquire temperature and relative humidity, the data that is written into the table will be “field = temp, unit = Celsius, value = 22/field = RH, unit = %, value = 50”. Similarly for a generator a data row may be “field = voltage, unit = volt, value = 240/field = power, unit = KW, value = 100”. Both these data points will go into the same column and another column for their unique device id. Having done this we simplified the design, its maintenance and reporting. A separate reporting module which normalizes the data after suitably extracting from the monitored table suffices to do all kind of reporting from each unique device. It is flexible enough to add new devices with its own unique parameters without changing the core tables. This is how we married structure and unstructured data.
As we approach end of another year, it is time to reflect what is in store for us in the Data Center world in 2015 and beyond.
Up until now, from the dot com boom days, we witnessed first the proliferation and later consolidation of data centers.
Now that is about to change. Gartner recently pointed out there is a new disruption about to happen in the data center world. Growth of Internet of Things (IOT) would churn out hitherto unseen volumes of data every minute from zillions of Internet-enabled wearables, devices and industrial equipment all around the world. This data would have to be analyzed instantaneously for operational intelligence, predictive and prescriptive analytics and even national security. This would have major impact on data center operations.
The current paradigm of centralized operations means transmitting data over thousands of miles from point of data origination to central servers and then sending back to originating point and elsewhere, again over thousands of miles, and a few times over. This would require massive bandwidths and massive computing power. Instead we will see an emergence of a “glocal” model, with distributed data centers serving immediate local needs and then transmitting raw as well as processed data to centralized processing centers for deeper analysis for global consumption.
When this happens, it will bring about an overhaul of business processes and have a transformational effect on data center operations. Data Center operations will look like an intricate supply chain network, involving business partners, transfer pricing, and running transportation and cost optimization models. Data Center operations will be software-driven, even transcending what is currently defined within the boundaries of a Software Driven Data Center (SDDC), which is primarily about virtualization of computing resources. Data Center Infrastructure Management (DCIM) will be at the heart of this transformation, with a workflow-driven Business Process Management (BPM) layer that would auto-enable data traffic via a set of algorithms working on constraint-based planning.
The use cases of Internet of Things, from consumer efficiency to health care to transportation and even agriculture is fascinating. What is less written about is the profound impact this will have back-end on the Data Centers that would have to support this new era of a ubiquitous connected world.
An often asked question: Is DCIM overkill for data centers that have less than 1000 square feet white space or those with connected power less than 500 KW?
This question comes up because of a commonly held myth that DCIM is only about power savings. The argument here is that monetary savings from reducing power consumption in a small data center would be minimal. While I would not dispute that, it may interest the CFO that the annual power cost in running this data center is approximately US$ 330,000 assuming PUE of 2.5 (typical of such a data center that has not undertaken any energy efficiency measures) and power tariff at $ 0.13 per kwh. Reducing the PUE to 2.0 would bring the annual power bill to U$ 260,000.00. While a small saving, I am not sure if $ 70,000 is totally irrelevant.
But DCIM tools are not just about cutting power costs, although that is indeed an important reason for larger data centers, and more so with multi-tenant data centers where over 40% of the operating costs are power related. In my mind, the fundamental reason why even a small data center should invest in DCIM software is for asset management, specifically asset relationship mapping. Mapping the entire chain from application and the application’s business owner to the Virtual Machine on which it is residing and all the way up in the power and network chain to the source of power (in small data centers mostly up to the UPS) and network routers and switches would introduce far greater financial savings that could be a blog topic by itself.
DCIM Asset Management’s relevance in a smaller data center is similar to the relevance of ERP in a smaller discrete manufacturing unit which may still have multiple levels of Bill of Material (BOM). Getting the BOM levels and the inter-relationships right is fundamental to lean manufacturing that brings about huge cost savings. It is also similar to the relevance of a CRM in a smaller Bank where customers may still have multiple banking relationships and CRM tracks that, enabling the Bank to provide better service and garner higher wallet share of the customer. A complete asset relationship mapping in a 1000 square feet data center, which may have as much as 500 inter-related devices, helps to avoid over provisioning (read wasted capital expenditure), better availability due to visibility (read avoiding costly failures) of the cascading impacts due to a device failure in the chain.
Just as an ERP or a CRM is universally adopted in a SMB enterprise for higher profitability, so also should DCIM find place in a smaller data center for lowering capital costs and mitigating the risks of a data center failure.
This was my first time at AFCOM’s Data Center World. Held in Las Vegas between 28th April and 2nd May, 2014 it was refreshingly different from other conferences as sponsor speakers could not advertise their wares during their speaking sessions. The sessions were aptly called “educational” as they provided an opportunity to hear some vendor-neutral versions of latest technologies. Of course, the hint was not lost when delivered by a vendor, but one was spared brazen advertisements. There were some outstanding educational sessions, including panel discussions on risk assessments, big data and of course also on DCIM. My session was on “Digital Services Efficiency: A New Management Scorecard.” The theme was how next-gen Data Center Infrastructure Management (DCIM) software would have role-based KPIs based on individual key result areas. The full version of the presentation is available here.
The exhibition floor displayed latest on power, cooling, security and fire suppression systems. That leads me to what I really want to discuss in today's blog.