Conférence
Notice
Lieu de réalisation
Maison des Sciences de l'Homme - Dijon
Langue :
Français
Conditions d'utilisation
Droit commun de la propriété intellectuelle
Citer cette ressource :
JCAD. (2022, 12 octobre). Playing with power at runtime: slightly slowed applications, major energy savings , in JCAD 2022. [Vidéo]. Canal-U. https://www.canal-u.tv/135212. (Consultée le 17 mai 2024)

Playing with power at runtime: slightly slowed applications, major energy savings

Réalisation : 12 octobre 2022 - Mise en ligne : 29 novembre 2022
  • document 1 document 2 document 3
  • niveau 1 niveau 2 niveau 3
Descriptif

Soberness—in terms of electrical power—of Data Centers and high-performance computing (HPC) systems is becoming an important design issue, as the global energy consumption of Information Technologies (IT) is rising at considerable levels. This question is all the more complex as these systems are increasingly heterogeneous and variable in their behavior with respect to their performance and power consumption. As applications struggle to make use of increasingly heterogeneous compute nodes, maintaining high efficiency (performance per watt) for the whole platform becomes a challenge. Additionally, applications tend to present phases (I/O, computing- or memory-intensive, check-pointing) which vary over time, and to be executed on an environment subject to external constraints (e.g., concurrency or energy envelop).

This increasing complexity makes HPC less predictable offline (prior to the execution). Therefore, dealing with time variations and unpredictable disturbances demands runtime management. In this work, we realize dynamical adaptation using feedback control, falling into the scope of autonomic computing, using control theory. Particularly, we address the problem of the control of the power allocated to processors, and hence their energy consumption and performance. The use of feedback control allows to reduce the energy consumption by decreasing the speed with limited and configurable performance loss, by exploiting periods where read/write operations slow down the progress. The proposed controller has an easily configured behavior: the user has to supply only an acceptable degradation level. An HPC application such as our system undergoes many variations of its behavior, depending on (i) the cluster, (ii) the node, (iii) the run, and even (iv) during the runtime.

We evaluate our approach on top of an existing resource management framework, the Argo Node Resource Manager, deployed on several clusters of Grid'5000, using a standard memory-bound HPC benchmark. Our results show the existence of a family of trade-offs to save energy, depending on the allowed degradation (from 0 to 20%). In particular, our control approach allows, on average, saving 22% energy at the cost of a 7% execution time, and climbs up to 25% energy savings with the adaptation. Our solution has shown to be robust to variations of the machines (from one node to another) and of the runs (from one execution of the application to another).

The experiments conducted in this work require to instrument low-level software stacks. Conducting this work on top of Grid'5000 was key as it allowed us to study various hardware setups (varying number of sockets, varying amount of memory) and their impact on the controller. The presence of clusters composed of homogeneous hardware allowed us to study the robustness of the devised control with respect to the variability in hardware performance despite identical specifications. Finally, our work relied on power measures as provided by the integrated sensors: we could extend this work by exploiting the available power sensors.

Our future works will tackle three remaining challenges: (i) handling various types of phases and their chaining in a application, (ii) distributed execution (different powercap enforced on each processor or core) and (iii) non-instrumented applications (for which an instrumentation is not possible).

Dans la même collection