An Overview of Design and Optimization of Data Center Power and Environmental Monitoring System_News Center Co., Ltd._Anke Electrical Co., Ltd. 
在线客服 在线客服邀请中...
关闭

您好,现在客服正邀请与您通话,请留下您的联系方式,客服将尽快与您取得联系。

 点击提交代表您同意《服务条款》《隐私政策》

Anke Electrical Co., Ltd.

EMS System, Power Monitoring System, Energy Consumption System, Pre-paid Syst...

17821733155
 

News Category
  • No Category

Contact Information
  • Contact person:李经理
  • Telephone:17821733155
  • Mobile:17821733155
  • Address:253 Yulü Road, Jiading District, Shanghai
Home > News Center Co., Ltd. > An Overview of Design and Optimization of Data Center Power and Environmental Monitoring System
News Center Co., Ltd.
An Overview of Design and Optimization of Data Center Power and Environmental Monitoring System
Publish Time:2024-07-08        View Count:3         Return to List

Summary:In conjunction with the bank data center construction project, we designed a power and environmental monitoring system for the bank's data center机房. We analyzed the monitoring objects of the data center机房, established a system monitoring architecture, provided methods for monitoring implementation, and, finally, in response to issues that arose after some time of operation, proposed optimization measures and suggestions. This has certain reference value for similar construction projects.

Keywords:Data Center; Environmental and Physical Monitoring; System Architecture; Network Topology

Introduction

Banking data center server rooms have abundant infrastructure such as power supply, distribution, and precision air conditioning, but a lack of human resources for equipment maintenance, which increases the intensity and difficulty of infrastructure operation and maintenance work. To promptly detect and address equipment failures, this article designs a power environment monitoring system and proposes optimization measures for the main issues encountered during operation.

1 System Monitoring Object

The monitoring objects of the Power Environment Monitoring System (hereinafter referred to as the PEMS) for the bank's data center can be divided into three main categories: 1) Real-time monitoring of the equipment's power system operation status, such as the switch status of power supply and distribution systems, UPS, and diesel generators, as well as their operating parameters and conditions; 2) Monitoring and control of the running environment within the机房, including temperature, humidity, water leakage, hydrogen concentration, and fire protection; 3) Monitoring of personnel and equipment entry and exit, such as access control systems, cameras, and anti-intrusion devices. The operation status of hardware devices like servers, switches, and encryption machines within the cabinets, which are related to security or network, is not included in the monitoring scope and is not discussed in this article.

2 System Architecture Design

2.1 Design Principles

The design of the bank's data center environmental monitoring system should adhere to a "centralized, integrated, and intelligent" design model, utilizing high-standard monitoring system design principles to achieve proactive, efficient, and process-oriented management.

(1) Stability. As the "butler" of the data center infrastructure, the environmental and physical monitoring system is required to provide uninterrupted service for 24 hours. This not only depends on the stability of the power supply for the environmental and physical monitoring equipment but also on the reliability of network communication.

(2) Safety: The signal collection loop of the environmental monitoring system should have robust protection mechanisms to prevent the monitored equipment from malfunctioning or failing due to loop failure. Additionally, the system should be equipped with self-check functions, enabling timely notification to maintenance staff through phone calls or text messages, detailing the location and nature of the equipment failure in the event of infrastructure issues.

(3) Openness: The environmental and physical monitoring system should adhere to open design standards, providing multiple external interfaces and compatibility with standard communication protocols such as MODBUS-TCP, OPC, OD-BC, and BACNET, to facilitate data transmission and exchange with third-party manufacturer's equipment.

(4) Scalability. The environmental monitoring system should be expandable and easy to maintain to accommodate changes such as the expansion of data center server rooms and the addition of monitoring equipment.

2.2 System Architecture

The environmental and power monitoring system utilizes computer networks, modern communication technologies, and control techniques to provide real-time monitoring of power equipment and the environment in server rooms, achieving modernized management without human attendance. Hardware-wise, it employs a three-tier architecture: the bottom layer consists of field equipment, including monitored devices and I/O collection modules; the middle layer is the data collection and processing layer, comprising various serial port servers, environmental and power servers, switches, etc.; and the top layer is the data application layer, made up of monitoring platforms or client terminals. Software-wise, it uses a B/S structure, collecting bottom-layer data through the installation of various sensors and data collection devices in the server room. It integrates all subsystems under a unified user interface, allowing for unified monitoring, control, and coordination of the subsystems, thereby forming a cohesive, collaborative whole.

Version 3 System Achieved

3.1 Project Overview

The data center's server room is located on the 6th floor, subdivided into Server Rooms 1, 2, and 3, the Network Room, and Power Distribution Rooms A and B. Key circuit breakers or switches, power meters, UPS systems, and lightning protection are located in the power distribution rooms. In the server room, new air systems, precision air conditioners, leak detection, cabinet PDUs, temperature and humidity control, and anti-intrusion systems (infrared detection) are to be included in the environmental monitoring and control system. The UPS battery room is situated on the -2nd floor, the diesel generator room on the -1st floor, the triple power switching room on the 1st floor, the operations room on the 7th floor, and the fire extinguisher room on the 8th floor. The monitoring objects of the environmental monitoring and control system are listed in Table 1.

3.2 Hardware Composition

The environmental monitoring system consists of 2 servers (hot-standby), 2 client PCs, a monitoring large screen, core switches (A and B networks), video aggregation switches, access control switches, collection boxes, and serial server, etc.

3.2.1 Core Data Collection Equipment

The collection box is responsible for collecting raw data such as switch states, temperature, and humidity, serving as the core of the entire monitoring system. It utilizes a Shenzhen Jitong rack-mounted design, measuring 2U in size and can be installed inside a cabinet. The collection modules inside the box are connected to the monitored equipment via terminal strips. The serial port server employs the Jitong OA-O9000E embedded intelligent management unit, which integrates data collection, parsing, storage, and alerting capabilities, featuring accurate fault location capabilities. It can accommodate data signals from various manufacturers' equipment and provide translation services.

3.2.2 On-site Equipment Layer

The equipment on the field device level is divided into four categories: equipment requiring protocol converters, equipment requiring communication protocols, analog direct collection modules, and digital direct collection modules.

Equipment requiring protocol converters (serial servers) includes precision air conditioners, leak detection ropes, cabinet PDUs, power meters, UPS power supplies, batteries, and diesel generators. These devices require the corresponding manufacturers to provide communication interfaces and open communication protocols for monitoring the operating parameters or status of the equipment.

(2) Devices requiring communication protocols include video surveillance and access control sub-systems. These devices must be provided with communication protocols by the respective manufacturers, which will be integrated and managed by the environmental and physical monitoring system, enabling real-time video monitoring of any camera by simply clicking on it within the monitoring interface and controlling the opening and closing of any door.

(3) Analog Direct Collection Module. ①Temperature and Humidity Monitoring: By installing temperature and humidity sensors in key areas, cold and hot aisles, and within cabinets within the server room, the system collects real-time data on temperature and humidity changes and thermal distribution. ②Hydrogen Monitoring: With a hydrogen collection module installed in the battery room, the system can detect if the PPM level exceeds the standard in real-time, allowing for early detection of battery leaks with hydrogen hazards; an alarm is triggered when the hydrogen PPM reaches the set threshold.

(4) Switch Quantity Collection Module. ① Critical Switch Monitoring: By monitoring the auxiliary contact status of important circuit breakers within the distribution cabinet, the on/off state of the switches is determined; when the monitored switch state differs from the set default state, the main monitoring system issues an alarm. ② Lightning Protection Monitoring: By monitoring the remote signal contacts of lightning protectors, the status of the lightning protectors is monitored in real-time; when the monitored lightning protector state differs from the set default state, the main monitoring system issues an alarm. ③ Fresh Air and Exhaust Monitoring: By installing pressure differential switches in the fresh air and exhaust ducts to detect pressure differential signals, the operation status of the fresh air and exhaust fans in the machine room is monitored, allowing remote control of the fresh air fan's start/stop. ④ Intrusion Prevention Monitoring: By installing infrared sensors in the machine room to monitor personnel movement, the system issues an alarm when the status of the infrared sensor is abnormal. ⑤ Fire Protection Monitoring: By collecting the alarm output signals from the fire control master unit, the fire protection status of each section in the machine room is monitored in real-time; upon an alarm, the system automatically switches to the corresponding monitoring interface, the fire alarm icon turns red and blinks, and an alarm event is generated while being recorded and stored.

3.2.3 Power Supply and Network Infrastructure Monitoring System

Hardware equipment requires dual UPS power supply to ensure reliable power, meeting the requirement of 24-hour continuous service; in addition, critical hardware equipment requires master-slave configuration, such as environmental monitoring servers, which are equipped with dual-machine hot backup functions. Utilizing the "dual monitoring system + dual database" model, it ensures the uninterrupted operation of the system.

The hardware equipment for the environmental and physical monitoring system requires a dual-network running device with a hierarchical aggregation mode. The network devices consist of POE switches, access switches, aggregation switches, and core switches. The POE switch is responsible for powering the video cameras and transmitting data; the access switch uses a Layer 2 switch with VLAN functionality to collect data from the collection unit; the aggregation switch uses a Layer 3 switch to aggregate POE switch data, preventing excessive Layer 2 networks from causing loops and also alleviating the data load on the core switch. The network topology structure of the environmental and physical monitoring system is shown in Figure 2.Figure 2: Network Topology Diagram

3.3 Software Platform

The Power and Environment Central Monitoring Platform software adopts a B/S architecture, collecting underlying data through the installation of various sensors and data collection devices in the server room. External manufacturer's equipment must provide communication interfaces and open communication protocols for data "translation" processing. It is centrally monitored through the server room monitoring platform, featuring full Chinese and graphical interfaces; the interface structure is clear with a real-time reflection of data status. The centralized monitoring platform can run on the Chinese Windows operating system. The Power and Environment Central Monitoring Platform software is modularly designed, dividing into collection, processing, management, and display layers, as shown in Figure 3. The personal work platform offers customizable interfaces, including the main monitoring interface, alarm event list, pending tasks, alarm level statistics, real-time PUE curve, and infrastructure category pie charts. Report management can generate detailed data records and analysis reports based on the existing server room management report formats, stored in Excel or PDF formats; data storage must exceed one year and have anti-tampering capabilities. In the software's interactive interface, the power and environment monitoring module provides a direct view of the real-time operating status of each server room. Hyperlinks to sub-interfaces, such as server room name and equipment icons, can be set to directly access them. Buttons like temperature and humidity monitoring, access control, video surveillance, temperature field, leakage detection, infrared monitoring, and fire monitoring can directly access individual screens. It also offers various data display formats such as electronic maps, real-time curves, pie charts, line charts, and histograms, facilitating the analysis of equipment historical operation trends for maintenance personnel to judge equipment conditions.

Figure 3: Power and Environmental Central Monitoring Platform Software Architecture Diagram

The system employs a combination of three alert methods—text messages, phone calls, and on-site voice alerts. Alert levels are categorized into three tiers: emergency, important, and general, with each level utilizing a distinct alert method for sending out information. Regardless of the current screen, alerts can be automatically prompted and displayed. Upon the resolution of an alert, the system can automatically send a corresponding recovery text message, enabling facility managers to stay informed of relevant updates at all times.

Issues and Optimization Measures During System Operation

4.1 Frequently Asked Questions

Since the environmental and physical monitoring system has been operational, issues have arisen including the failure to refresh data on the monitoring platform, the computer going "unresponsive," inaccuracies in data collection, system alarms missing, false positives, frequent triggering (alarm signal fluctuation), and delayed notifications.

(1) The monitoring platform data is not refreshing. This situation is quite common in actual operation and maintenance work, where the entire monitoring platform software data or data from a specific device within the system fails to refresh, resulting in operation and maintenance personnel not receiving alarm notifications.

(2) Inaccurate data collection is evident, particularly when the data displayed on the monitoring screen does not match the actual operating data from on-site equipment. If the displayed data exceeds the alarm threshold, it may lead to false alarms or failure to trigger alarms, impacting the safety of equipment operation. For instance, if the measured values of smart instruments differ from or have different units than those shown by the system, it can result in the loss of monitoring for the device. During the cold and hot channel temperature detection process, the monitoring software triggers an alarm when the displayed values exceed the upper limit of the alarm threshold. However, upon on-site inspection by operations personnel, the values are found not to exceed the limit, leading to a waste of human resources.

(3) Real-time Alert Issues. The problems of missing, false, frequent, delayed alerts, and alert signal fluctuations in the environmental monitoring system are serious issues that困扰 data center operations staff. Data center operations staff work on emergency duty 7x24 hours, and false and frequent alerts can cause significant physical harm to personnel; while missing and delayed alerts can lead to failure to timely notify when equipment malfunctions, resulting in more severe data center accidents.

Alarm Omission: The primary cause of alarm omission is due to an overly low alarm level setting, communication interruption with the equipment, or a fault in equipment information collection, resulting in the loss of critical alarm information. This failure to report to maintenance personnel in a timely manner can lead to the loss of important alarm data and severe consequences.

② False Alarm: False alarms are a critical indicator of the availability of the environmental and physical monitoring system. Various factors can lead to false alarms, including electromagnetic interference or changes in the surrounding environment, incorrect protocol parsing, faults in the collection devices, instrument malfunctions, and issues with board card ports. For instance, a leak detection rope around a precision air conditioner may cause false alarms due to dust or sand, which increases the resistance and triggers an alarm.

③ Frequent Alarms: Frequent alarms, akin to "information overload," can be categorized into two scenarios: First, the same alarm message is repeatedly sent to operations personnel, due to the fluctuation of the collected values around the alarm threshold when a certain monitoring point triggers an alarm; Second, multiple equipment in the data center trigger alarms simultaneously upon the same event, such as during power outages or brief power failures followed by recoveries, leading to a "phone text bombarding" caused by various devices including important switches, power meters, UPS systems, and cabinet PDUs.

④ Alarm Delay: The timeliness of reporting alarm information to operations personnel is a critical indicator of the effectiveness of a monitoring system. The reporting time should be an optional setting for users. For instance, in scenarios like an immediate power surge recovery, a certain delay can be set. However, important information should be reported within 15 seconds.

4.2 Optimization Measures

(1) The resolution of the data refresh issue on the monitoring platform. Operations personnel must be familiar with the architecture and network topology of the environmental and monitoring system, eliminating problems from single-point device failures to network issues. In necessary cases, optimize the system structure or network topology for critical equipment, perform redundant backups for data collection devices or systems, or employ A&B dual-network communication for critical monitoring objects.

(2) Resolution of Data Collection Accuracy Issues: Verify the accuracy of communication protocols for smart instrument devices or third-party equipment. Confirm the correctness of the device protocol text with the original manufacturer's technical support. Failing to verify communication protocols when replacing smart instruments can lead to inaccurate data or inability to collect data. Check for communication faults, starting with inspecting physical connections for issues, followed by examining communication configurations, including baud rate, parity bits, and serial port settings. Inspect for hardware malfunctions in collection devices or cabinets, as well as temperature and humidity sensors, to rule out hardware issues causing inaccurate data collection.

(3) Optimization of Real-time Alert Issues. Firstly, strictly control the number of connected intelligent devices to prevent excessive connections from causing slow data uploads, which in turn leads to alert delays. Reasonably configure the FSU scanning time by adjusting the environmental and physical device scanning cycle to shorten the inquiry time of data collection devices at various measurement points, thereby increasing collection speed. Secondly, select and optimize measurement points for devices reasonably, avoiding over-scanning points that burden collectors excessively and impact collection efficiency. Additionally, operations personnel should avoid scanning non-critical data that consumes too many resources, leading to slow data collection. Furthermore, controlling frequent and unreasonable alerts through software methods can increase the alert hysteresis shielding function. For data collection values that exceed reasonable limits, set effective threshold upper and lower limits to shield this part of the data and eliminate false alerts. For false alerts caused by electromagnetic interference during transmission, in addition to effective threshold shielding on the software side, anti-interference magnetic rings can be installed on transmission lines to reduce interference. Finally, advanced methods such as artificial intelligence are employed to enhance alert logic relationship analysis and reasonable classification of alert information. This includes adding alert traceability features, distinguishing primary and secondary alerts, determining the master-slave relationship of the devices generating alerts, and thereby effectively optimizing alert information to reduce "alert information overload" while ensuring no critical alert information is missed.

Ankorri 5 Environmental Monitoring System Solution

Through the data center's environmental and physical monitoring system, we have achieved real-time monitoring of access control, water leakage, smoke, video, environmental conditions, high and low voltage power distribution, and equipment operation status. This system also provides real-time alerts to ensure the normal operation of the data center, preventing equipment failure due to uncontrolled operating environments. It guarantees the safety of maintenance personnel, extends the lifespan of equipment, and reduces costs associated with the inefficient management of power distribution rooms. Additionally, the system enables environmental and physical monitoring, along with energy consumption analysis, to assist users in optimizing energy efficiency.

5.1 System Function

(1) Display the total energy consumption of the current data center, IT energy consumption, air conditioning energy consumption, and other energy consumptions, and calculate the real-time PUE value of the data center. Present these data intuitively through a dashboard.

(2) Select to view the main wiring diagram of the medium and low voltage distribution system in the data center, and display the current remote monitoring, remote signal data, and status of the distribution system on a single diagram. Real-time monitoring of power parameters such as voltage and current in each distribution cabinet, as well as environmental conditions such as temperature and humidity, smoke detection, water immersion, and access control in the substation.

(3) Real-time temperature monitoring of electrical contacts, with wireless temperature sensors installed at locations such as circuit breaker contacts, contact arms, busbars, and cable connections to detect junction temperatures. This facilitates early detection of temperature anomalies that may lead to accidents.

Monitor various transformer parameters, including load factor, frequency, power factor, and three-phase unbalance, and display historical curve graphs with real-time data changes. Help users directly understand the transformer's operating status.

(5) Online monitoring of electrical power quality, capable of detecting current and voltage harmonic distortion rates, voltage transient surges, drops, and interruptions, as well as recording of transient events, and ITIC tolerance curves.

(6) The system collects three-phase voltage, current, active power, power factor, and frequency at the input, output, and bypass of the UPS, while also monitoring the UPS temperature, battery voltage, remaining runtime under the current load, and other data.

(7) Display individual battery voltage, internal resistance, and temperature, predict remaining time under load, and set abnormal alarm for each battery's data to promptly detect anomalies in the lead-acid battery.

(8) Display electrical parameters of the incoming and outgoing lines in the precision distribution cabinet, including current, voltage, power, energy, and switch status. Data can be set for alarm and classified, sourced from the measurement module of the precision distribution cabinet.

(9) Display electrical parameters of the initial box and junction box of the intelligent mini busbar, including current voltage, switch status, and junction point temperature, and set up alarm and grading for the data.

(10) Visualize data center energy distribution and equipment layout through a floor plan, displaying energy consumption data for each device. Clicking on a device on the floor plan leads to a specific equipment monitoring interface.

(11) Real-time display of the current data center PUE value and historical PUE curve. Additionally, it shows the energy consumption status and ranking of each sub-item. Monitors the operation and load of transformers, providing the ranking of transformer output electricity this month.

(12) Display daily/monthly/yearly reports of electrical energy consumption, and choose to view line graphs or pie charts for specific circuits. Conduct year-on-year and month-on-month analysis of data center electricity usage to observe trends.

(13) Monitor the return air temperature and humidity of precision air conditioners, as well as the return water temperature, and can set the temperature and humidity of precision air conditioners to achieve better control effects.

(14) Monitor parameters such as temperature and humidity, door status, water immersion, smoke, noise, and gas concentration in the data center. The graphical curves are intuitive and clear, and also support historical data queries.

(15) Display the number of various alarm events in a list, the daily alarm counts in a bar chart, and provide the total alarm count along with the growth trend.

The company has enhanced its management functions, allowing for inspection, dispatching, defect elimination, and emergency repair maintenance for all major equipment in data centers.

6 Conclusion

Environmental and physical monitoring systems play a crucial role in the operation and maintenance of infrastructure equipment in bank data centers, essentially serving as the "eyes, ears, and nose" of the operations team. The smooth operation of these systems hinges on their 24/7 uninterrupted service and the timely notification of critical alerts to relevant operations personnel. The stability, reliability, and correct operation of the environmental and physical monitoring system are primarily dependent on the design of the system architecture, power supply, and network setup. During the operations process, the system is bound to encounter some issues, necessitating continuous experience summarization, problem identification, and optimization. This article, based on the actual construction of environmental and physical monitoring systems in bank data centers, designs and implements the system, and optimizes the issues encountered during operation, proving the feasibility of this strategy.

[Reference]

Wan Liyong. Design and Optimization of Power Environment Monitoring System for Data Center Machine Rooms. [J]. Electrical Engineering Technology, 2022(15): 164-167.

Li Ke, Wang Jiajia. Design of Operation Management Platform for Data Center Infrastructure of Electric Power Enterprises [J]. Digital Technology and Application, 2021(39): 196-197.

Ankore Enterprise Microgrid Design & Application Manual, 2022.5 Edition

 Click submit means you agree to《Service terms》《Privacy policy》

17821733155