Author(s):
Domingues, Patrício Rodrigues
Date: 2008
Persistent ID: http://hdl.handle.net/10400.8/118
Origin: IC-online
Subject(s): Fault tolerance; Desktop grids; Volunteer computing; Checkpointing; Scheduling; Sabotage-tolerance
Description
Tese apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra para otenção do grau de Doutor em Engenharia Informática, orientada por Luís Moura e Silva. It is a well-known fact that most of the computing power spread over the Internet
simply goes unused, with CPU and other resources sitting idle most of the
time: on average less than 5% of the CPU time is effectively used. Desktop grids
are software infrastructures that aim to exploit the otherwise idle processing power,
making it available to users that require computational resources to solve longrunning
applications. The outcome of some efforts to harness idle machines can
be seen in public projects such as SETI@home and Folding@home that boost impressive
performance figures, in the order of several hundreds of TFLOPS each. At
the same time, many institutions, both academic and corporate, run their own desktop
grid platforms. However, while desktop grids provide free computing power,
they need to face important issues like fault tolerance and security, two of the main
problems that harden the widespread use of desktop grid computing.
In this thesis, we aim to exploit a set of fault tolerance techniques, such as
checkpointing and redundant executions, to promote faster turnaround times. We
start with an experimental study, where we analyze the availability of the computing
resources of an academic institution. We then focus on the benefits of sharing
checkpoints in both institutional and wide-scale environments. We also explore hybrid
schemes, where the traditional centralized desktop grid organization is complemented
with peer-to-peer resources.
Another major issue regarding desktop grids is related with the level of trust
that can be achieved relatively to the volunteered hosts that carry out the executions.
We propose and explore several mechanisms aimed at reducing the waste of
computational resources needed to detect incorrect computations. For this purpose,
we detail a checkpoint-based scheme for early detection of errors. We also propose
and analyze an invitation-based strategy coupled to a credit rewarding scheme, to
allow the enrollment and filtering of more trustworthy and more motivated resource
donors.
To summarize, we propose and study several fault tolerance methodologies
oriented toward a more efficient usage of resources, resorting to techniques such
as checkpointing, replication and sabotage tolerance to fasten and to make more
reliable executions that are carried over desktop grid resources. The usage of techniques
like these ones will be of ultimate importance for the wider deployment of
applications over desktop grids.