FDAE: A f̲ailure d̲etector for a̲synchronous e̲vents
- Autori: Farruggia, A.; Ortolani, M.; Lo Re, G.
- Anno di pubblicazione: 2010
- Tipologia: Contributo in atti di convegno pubblicato in volume
- OA Link: http://hdl.handle.net/10447/53363
Abstract
Detecting element failures is a relevant issue in distributed systems. A fault tolerant system needs to detect a failure and recover from it promptly. In fact, traditional approaches to fault tolerance are usually not completely free from errors during the failure detection phase; a good failure detector is thus a very important component of them to minimize these errors. In this paper we present a failure detector able to monitor both asynchronous and synchronous elements of a distributed system by exchanging messages with the monitored elements. In order to assess the health status of monitored elements our failure detector relies on a simple query/ACK mechanism, which however requires a reliable timeout estimate in order to properly set the monitoring interval. To this purpose our failure detector uses the history of past estimates to compute new values for both quantities. The model proposed here introduces a new label to tag monitored elements, besides those used in traditional failures detectors. To evaluate this work, we compared it with two other algorithms by computing performance metrics, such as specificity and sensitivity, and by considering the number of required control packets. We also compared the performance of the failure detectors by computing their detection time.
