Skip to Main content Skip to Navigation
Poster communications

Early failure detection: a method and some applications

Abstract : Short overview: Both Grid middleware services and applications face failures, and the more widely deployed they are, the higher is the price for not detecting the failures early (lost jobs, wasted resources ...). Automated detection, diagnosis, and ultimately management, of software/hardware problems define autonomic dependability. This work report on a generic mechanism for autonomic detection of EGEE failures involving abrupt changes in the behaviour of quantities of interest, and on some applications. Analysis: The complexity of the hardware/software components, and the intricacy of their interactions, defeat attempts to build fault models only from a-priori knowledge. A black-box approach, where we observe the events to spot outliers, is appealing by its simplicity, and large body of experience in quality control. The general challenge is to detect anomalies as soon as possible. Much better solutions than simple thresholding are routinely used in e.g. clinical trials and the supervision of production lines. In the case of abrupt changes, the Page-Hinkley statistics provides a provably efficient method, which minimizes the time to detection for a prescribed false alarm rate. We have applied this method to quantities (e.g. number of arrived and served jobs per unit of time) that are easily computed from the output of existing services. The main result is that we are able to efficiently detect failures of very different origins (e.g. some software bugs, blackholes) without human tuning. Impact: Fast and reliable detection of failures can both raise alarms bringing operator intervention, as well as trigger automatic reaction, e.g. avoid job submission to blackhole sites. The proposed method is quite general, and can be applied at various points in the middleware, including the site level, or by end-user software. Nonetheless, gLite Logging and Bookkeeping service, which concentrates information on the job processing, would be the most effective target. The approach of affecting job scheduling by LB-computed statistics had been used before. Experimental validation and comparison is thus desirable: a significant dataset of “challenge examples” should be available. Examples tagged by system administrators are rare. The Job Provenance (archive of LB data and more) provides the required information from two aspects: easy access to filtered L&B data, and valuable information for calibrating and evaluating failure detection methods wrt. known and well-understood past events. Conclusions: The implementation of the statistics per-se is fairly straightforward. The codes for exploiting the test on archived data, including both the extraction of the quantities of interest and the test itself, will be released through the Grid Observatory, in order to demonstrate the performance and scalability levels required for the production environment. Full integration into gLite raises the usual technical issues, and appropriate tools (triggering alarms etc.) remain to be developed.
Complete list of metadatas

http://hal.in2p3.fr/in2p3-00453832
Contributor : Sabine Starita <>
Submitted on : Friday, February 5, 2010 - 4:25:30 PM
Last modification on : Wednesday, September 16, 2020 - 4:54:15 PM

Identifiers

  • HAL Id : in2p3-00453832, version 1

Collections

Citation

C. Germain-Renaud, A. Krenek. Early failure detection: a method and some applications. 3rd EGEE User Forum, Feb 2008, Clermont Ferrand, France. ⟨in2p3-00453832⟩

Share

Metrics

Record views

108