Reliability and Availability¶
Reliability and availability are significant metrics for systems design, implementation, and operation. Prediction and early identification of potential issues with any of the two aspects remains one of the cornerstones of safe and efficient operation of the entire accelerator complex. Their importance increases even further when the systems in question are safety-critical ones, such as Machine Protection Systems. Therefore, a significant effort is undertaken to address the two aspects appropriately.
Reliability concerns faultless operation of a system. It can be defined as a function of success R(t=T) at time t. In other words, it quantifies the probability of having a fault within the system lifetime.
Availability describes the effective fraction of time when the system is operating (or capable of doing so). A range of reasons can lead to situations where the system is degraded or requires time for maintenance. The ratio of uptime to the total time is a key performance indicator, providing information on the robustness of the system and potential ways to improve it.
Methods for quantitative analysis of both indicators are based on modelling. The team takes advantage of software developed from within, as well as commercially available solutions.
Commercial software used for prediction of failure rate (based on the Military Handbook), FMECA analysis, and fault trees. This general-purpose software gives an interesting first approximation, but AvailSim4 is then used to take into account particular features
A Monte Carlo simulations tool developed in-house for the needs of the availability and reliability studies in the accelerator context. Its first release took place in March 2021, providing a fully capable and easy-to-use environment for defining models, running extensive simulation campaigns, and obtaining results in a user-friendly format.
The software is continuously developed. Main features of the second release (Dec 2021) include: simulation of LHC phases, possibility to model complex repair strategies, root cause analysis module, custom-defined redundancy logic (MYRRHA use case with MPE-MI), new simulation methods: Importance Splitting algorithm with x100 speed-up for rare event simulations.
LHC Risk Matrix¶
A risk matrix is a common tool used in risk assessment, defining risk levels with respect to the severity and probability of the occurrence of an undesired event. Risk levels can then be used for different purposes, e.g. defining subsystem reliability or personnel safety requirements. Over the history of the Large Hadron Collider (LHC), several risk matrices have been defined to guide system design. Initially, these were focused on machine protection systems, more recently these have also been used to prioritise consolidation activities. A new data-driven development of risk matrices for CERN’s accelerators is presented in this paper, based on data collected in the CERN Accelerator Fault Tracker (AFT). The data-driven approach improves the granularity of the assessment, and limits uncertainty in the risk estimation, as it is based on operational experience. In this paper, the authors introduce the mathematical framework, based on operational failure data, and present the resulting risk matrix for LHC.
HL-LHC Inner Triplet, D1, D2, and corrector magnets¶
The reliability of the protection system of the HL-LHC Inner Triplet, D1, and D2 magnets has been performed using AvailSim4. Their protection is based on Quench Heater (QH), Coupling-Loss Induced Quench system (CLIQ) and Energy Extraction systems in various configurations for the various magnet.
A comprehensive reliability study of the BIS2.
HL-LHC Energy Extraction¶
This project is a reliability study of the HL-LHC Energy Extraction systems. The upgraded version involves several new concepts and a new technology that has not been used earlier at CERN (vacuum interrupters). The objective of this project is to quantify the likelihood of critical failures through modelling and simulation methods. Any period of EE system malfunctioning might expose the corresponding magnet to potentially severe outcomes of quenches, rendering it vulnerable to irreversible damage. Such damage might cause long delays of a 1-month to a 1-year interval in the LHC risk matrix. High, near-perfect reliability whenever quench occurs is therefore very important to ensure an appropriate level of protection and to meet the adopted targets. Simulations performed within this study include investigation of the failure rates estimations for individual components, redundant aspects of the new system, planned monitoring and maintenance strategies. The models are developed and evaluated in AvailSim4 - a simulation software developed in-house specifically to address the needs of availability and reliability studies in accelerator-related systems
Component level analysis has been completed in AvailSim4 using Monte Carlo simulations, providing interesting insights into the system reliability dynamics.