MONARCH: Hierarchical Storage Management for Deep Learning Frameworks

Marco Dantas,Diogo Leitao,Cláudia Rafaela Gomes Correia,Ricardo Macedo,Weijia Xu,João Paulo

MONARCH: Hierarchical Storage Management for Deep Learning Frameworks

2021

Due to convenience and usability, many deep learning (DL) jobs resort to the available shared parallel file system (PFS) for storing and accessing training data when running in HPC environments. Under such a scenario, however, where multiple I/O-intensive applications operate concurrently, the PFS can quickly get saturated with simultaneous storage requests and become a critical performance bottleneck, leading to throughput variability and performance loss.We present MONARCH, a framework-agnostic middleware for hierarchical storage management. This solution leverages the existing storage tiers present at modern supercomputers (e.g., compute node’s local storage, PFS) to improve DL training performance and alleviate the current I/O pressure of the shared PFS.We validate the applicability of our approach by developing and integrating an early prototype with the TensorFlow DL framework. Results show that MONARCH can reduce I/O operations submitted to the shared PFS by up to 45%, decreasing training time by 24% and 12%, for I/O-intensive models, namely LeNet and AlexNet.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations