Skip to content

Reliability and Availability

Reliability and availability are critical metrics in system design, implementation, and operation. Predicting and identifying potential issues in these areas early on is essential for ensuring the safe and efficient operation of the entire accelerator complex. Their significance is even greater for safety-critical systems, such as Machine Protection Systems, where failures can have serious consequences. As a result, substantial efforts are made to address these aspects effectively.

Reliability refers to a system’s ability to function without failure. It is often expressed as the probability of successful operation, R(t=T), at a given time t, effectively quantifying the likelihood of a fault occurring within the system’s lifetime.

Availability, on the other hand, measures the proportion of time a system is operational or capable of functioning. Various factors, including system degradation and maintenance actions, can impact availability. It serves as a key performance indicator of CERN accelerators, offering insights into system robustness and potential areas for improvement.

Tools

Tracking, assuring and optimizing reliability and availability is a core task of the reliability team. It takes advantage of a range of internally developed software as well as commercially available solutions.

AFT Statistics Website

Accelerator Fault Tracking is an organization-wide activity to track availability across the accelerator complex. This is mainly done by experts with the AFT tool (only available from within the CERN network) provided by BE-CSS. To make the collected data available to a wider audience and in a pre-processed format, the AFT Statistics Website provides up-to-date availability and failure rate statistics for most accelerators and their sub-systems across the complex.

AvailSim4

An open-source Monte Carlo simulations tool developed in-house for the needs of the availability and reliability studies in the accelerator context. Its first release took place in March 2021, providing a fully capable and easy-to-use environment for defining models, running extensive simulation campaigns, and obtaining results in a user-friendly format.

The software is continuously developed. Main features include: phase dependent failure and repair behaviour, possibility to model complex repair strategies, root cause analysis module, custom-defined redundancy logic by injecting python code, increased computational efficiency through Quasi-Monte-Carlo and Importance Splitting algorithms for rare event simulations.

In comparison to commercial tools, its key strength is that it can be be run as part of python scrips, has tabular inputs and outputs and can be massively parallelized with the CERN HTCondor cluster.

It is available on its Availsim4 Repository and can be installed via pip.

Isograph

Commercial software used for prediction of failure rate (based on e.g. the Military Handbook), FMECA analysis, and fault trees. This general-purpose software for reliability modeling is available at CERN with several user licenses and can directly be installed from CMF.

Component Failure Rate Prediction Pipeline & FMECA Assist Tools

For electronics component reliability predictions & FMECAs an automated pipeline is available, to process EDA design files and produce FMECA templates. This automates the previously manual and time consuming process of generating all electronic components of a circuit board in Isograph and reduces chance of human error. The pipeline works with the 217+ prediction standard and integrates with Isograph, so that manual corrections can still be made. It is available online.

LHC Risk Matrix

LHC Risk Matrix

A risk matrix is a common tool used in risk assessment, defining risk levels with respect to the severity and probability of the occurrence of an undesired event. Risk levels can then be used to define subsystem reliability or personnel safety requirements. Over the history of the Large Hadron Collider (LHC), several risk matrices have been defined to guide system design. Initially, these were focused on machine protection systems, more recently these have also been used to prioritise consolidation activities. A new data-driven development of risk matrices for CERN’s accelerators is available here, based on data collected in the CERN Accelerator Fault Tracker (AFT). The data-driven approach improves the granularity of the assessment, and limits uncertainty in the risk estimation, as it is based on operational experience. This is now the default tool to identify reliability requirements.

Reliability Studies for Machine Protection Systems

A core activity is the reliability assurance of MPE systems. These studies integrate with the overall protection system life-cycle and are focused on identifying top-level risks and associated reliability requirements as well as bottom-up assessments to ensure that these requirements can be met. Outputs of these studies drive design improvements. Examples of ongoing and completed studies are mentioned below.

Protection Systems of the HL-LHC Inner Triplet, D1, and D2

The reliability of the protection system of the HL-LHC Inner Triplet, D1, and D2 magnets has been performed using AvailSim4. Their protection is based on Quench Heater (QH), Coupling-Loss Induced Quench system (CLIQ) and Energy Extraction systems in various configurations for the various magnet.

HL-LHC Energy Extraction

HL-LHC Energy Extraction

This project is a reliability study of the HL-LHC Energy Extraction systems. The upgraded version involves several new concepts and a new technology that has not been used earlier at CERN (vacuum interrupters). The objective of this project is to quantify the likelihood of critical failures through modelling and simulation methods. Any period of EE system malfunctioning might expose the corresponding magnet to potentially severe outcomes of quenches, rendering it vulnerable to irreversible damage. Such damage might cause long delays of a 1-month to a 1-year interval in the LHC risk matrix. High, near-perfect reliability whenever quench occurs is therefore very important to ensure an appropriate level of protection and to meet the adopted targets. Simulations performed within this study include investigation of the failure rates estimations for individual components, redundant aspects of the new system, planned monitoring and maintenance strategies. The models are developed and evaluated in AvailSim4 - a simulation software developed in-house specifically to address the needs of availability and reliability studies in accelerator-related systems

Component level analysis has been completed in AvailSim4 using Monte Carlo simulations, providing interesting insights into the system reliability dynamics.

Energy extraction results

Availability Assessment of Future Accelerators

FCC

The Future Circular Electron-Positron Collider (FCC-ee) is CERN’s leading proposal for the next generation of energy-frontier particle accelerators. With 91 km circumference, it is ambitious in both size and technical objectives. So much so that simply the number of components that must be simultaneously operational is a risk to luminosity and physics goals. Availability and reliability are therefore key considerations, driving decisions even in this early stage of the design process. These studies model availability for this future machine, first by deconstructing contributions from each main constituent system, then by simulating their interconnection in an enhanced Monte Carlo environment. This has highlighted significant challenges relating to physics performance and operational cost that must be addressed in the FCC-ee’s technical design stage. Various gamechanging R&D opportunities are further proposed that could change the landscape of this exciting field.

Reliability and Availability Working Group (RAWG)

The RAWG is an ATS-wide advisory body in the field of - Designing Dependable Systems - Availability and Availability Optimisation of the Operational Accelerator Complex - Reliability Analysis & Assessment for Future Accelerators - Building Collaborations, Internally and Externally

Contact

Lukas Felsberger