version française rss feed
HAL : hal-00715013, version 1

Fiche concise  Récupérer au format
Geographical failover for the EGEE-WLCG Grid collaboration tools
Mathieu G., L'Orphelin C., Aidel O., Cavalli A., Pagano A. et al
Journal of Physics: Conference Series, 119 062022 (2008) 062022 - http://hal.archives-ouvertes.fr/hal-00715013
Informatique/Base de données
Informatique/Architectures Matérielles
Geographical failover for the EGEE-WLCG Grid collaboration tools
Gilles Mathieu ()1, Cyril L'Orphelin ()1, Osman Aidel ()1, Alessandro Cavalli ()2, Alfredo Pagano ()2, Rafal Lichwala ()3
1 :  CC IN2P3 - Centre de Calcul de l'inst. national de phy. nucléaire et de phy. des particules
CNRS : USR6402 – IN2P3
12-14, boulevard Niels Bohr 69622 VILLEURBANNE CEDEX
2 :  INFN, Sezione di Bologna - Istituto Nazionale di Fisica Nucleare, Sezione di Bologna
Viale B. Pichat, 6/2 40127 Bologna
3 :  PSNC - Poznan Supercomputing and Networking Center
Poznan Supercomputing and Networking Center
Worldwide grid projects such as EGEE and WLCG need services with high availability, not only for grid usage, but also for associated operations. In particular, tools used for daily activities or operational procedures are considered critical. In this context, the goal of the work done to solve the EGEE failover problem is to propose, implement and document well-established mechanisms and procedures to limit service outages for the operations and monitoring tools used by regional and global grid operators to control the status of the EGEE grid. The operations activity of EGEE relies on different tools developed by teams from different countries. For each tool, only one instance was deployed prior to this work, thus representing single points of failure. In our work, we solved the problem by replicating tools in different sites, using specific DNS features to automatically swap a given service instance in case of failures. After a DNS test phase in a virtual machine (vm) environment focused on nsupdate, NS/zone configuration and fast TTLs, a new domain for grid operations (gridops.org) was registered. In addition, replication of databases, web servers and web services have also been investigated and configured. In this paper, we describe the technical mechanism used in our approach. We also show the replication procedure implemented for the EGEE/WLCG CIC Operations Portal use case. Furthermore, we present the interest in failover procedures in the context of other grid projects and grid services. Future plans for improvements of the procedures are also described.

Journal of Physics: Conference Series
Publisher Institute of Physics: Open Access Journals
ISSN 1742-6588 (eISSN : 1742-6596)
Articles dans des revues avec comité de lecture
119 062022

Failover – réplication – redondance – Grille

Numéro Cordis 34289
Acronyme EGEE-II
Titre Enabling grids for E-sciencE-II