Hewlett Packard Enterprise (HPE) recently announced a collaboration with the U.S. Department of Energy’s National Renewable Energy Laboratory (NREL) to develop Artificial Intelligence (AI) and Machine Learning (ML) technologies to automate and improve operational efficiency, including resiliency and energy usage, in data centers for the exascale era. The effort is part of NREL’s ongoing mission as a world leader in advancing energy efficiency and renewable energy technologies to create and implement new approaches that reduce energy consumption and lower operating costs.
The project is part of a three-year collaboration that introduces monitoring and predictive analytics to power and cooling systems in NREL’s Energy Systems Integration Facility (ESIF) HPC Data Center.
HPE and NREL are using more than five years’ worth of historical data — more than 16 TB in total — collected from sensors in NREL’s supercomputers, Peregrine and Eagle, and its facility to train models for anomaly detection that predict and prevent issues before they occur.
The collaboration will also address future water and energy consumption in data centers, which in the United States alone will reach approximately 73 billion kWh and 174 billion gallons of water by 2020. HPE and NREL will focus on monitoring energy usage to optimize energy efficiency and sustainability as measured by key metrics such as Power Usage Effectiveness (PUE), Water Usage Effectiveness (WUE) and Carbon Usage Effectiveness (CUE).
Early results based on models trained with historical data have successfully predicted or identified events that previously occurred in NREL’s data center, demonstrating the promise of using predictive analytics in future data centers.
The AI Ops project sprung from HPE’s R&D efforts regarding PathForward, a program backed by the U.S. Department of Energy to accelerate the nation’s technology roadmap for exascale computing, which represents the next major leap in supercomputing. HPE realized a critical need to develop AI and automation capabilities to manage and optimize data center environments for the exascale era. Applying AI-driven operations to an exascale supercomputer—which will run at a speed that will represent a thousand-fold increase over today’s systems—will enable energy-efficient operations and increase resiliency and reliability through smart and automated capabilities.
“We are passionate about architecting new technologies that are impactful to powering the next era of innovation with exascale computing and its extent of operational needs,” said Mike Vildibill, vice president of Advanced Technologies Group, HPE. “We believe our journey to develop and test AI Ops with NREL, one of our longstanding and innovative partners, will allow the industry to build and maintain smarter and more efficient supercomputing data centers as they continue to scale power and performance.”
“Our research collaboration will span the areas of data management, data analytics and AI/ML optimization for both manual and autonomous intervention in data center operations,” said Kristin Munch, manager for Data, Analysis and Visualization Group, NREL. "We’re excited to join HPE in this multi-year, multi-staged effort and we hope to eventually build capabilities for an advanced smart facility after demonstrating these techniques in our existing data center.”
The project will use open source software and libraries such as TensorFlow, NumPy and Sci-kit to develop machine learning algorithms. The project will focus on the following key areas:
- Monitoring: Collect, process and analyze vast volumes of IT and facility telemetry from disparate sources before applying algorithms to data in real time
- Analytics: Big data analytics and machine learning will be used to analyze data from various tools and devices spanning the data center facility
- Control: Algorithms will be applied to enable machines to solve issues autonomously as well as intelligently automate repetitive tasks and perform predictive maintenance on both the IT and data center facility
- Data center operations: AI Ops will evolve to become a validation tool for continuous integration (CI) and continuous deployment (CD) for core IT functions that span the modern data center facility
HPE plans to demonstrate additional capabilities in the future with the enhancement of the HPE High Performance Cluster Management (HPCM) system to provide complete provisioning, management and monitoring for clusters scaling to 100,000 nodes at a faster rate. Other testing plans include exploring the integration of HPE InfoSight, a cloud-based AI-driven management tool that monitors, collects and analyzes data on IT infrastructure. HPE InfoSight is used to predict and prevent probable events to maintain the overall health of server performance.