View Complete Reference

Ferreira, KB and Levy, S (2023)

Using Benford's Law to Identify Unusual Failure Regions

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pp. 516–519 .

ISSN/ISBN: Not available at this time. DOI: 10.1145/3624062.3624121



Abstract: Fault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.


Bibtex:
@inproceedings{, author = {Ferreira, Kurt B. and Levy, Scott}, title = {Using Benford's Law to Identify Unusual Failure Regions}, year = {2023}, isbn = {9798400707858}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3624062.3624121}, doi = {10.1145/3624062.3624121}, booktitle = {Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis}, pages = {516–519}, }


Reference Type: Conference Paper

Subject Area(s): Computer Science