Reliability, availability, and maintainability analysis of the Enhanced Resolution Imager and Spectrograph adaptive optics module and lesson learned before commissioning

Abstract. Enhanced Resolution Imager and Spectrograph is an instrument currently under commissioning at the Cassegrain focus of the Very Large Telescope UT4. Its mission is to replace the suite of instruments NAOS-CONICA and SINFONI and push to the edge the capabilities of this 8-meter class telescope, by leveraging the adaptive optics module. The instrument has been designed for maximum lifetime and reliability and minimum downtime. We will present the instrument constraints and our approach to the reliability, availability, and maintainability (RAM) analysis. We identified the main actors in the system, then for each of them, we compiled a database of reliability parameters in order to build-up the reliability diagram, describing the failure sources. Starting from this information, we computed the system-wide reliability parameters and compared them with the requirements by the customer. Such a scheme is very general and may be taken as an example of RAM analysis for astronomical instrumentation; it may be also customized for the needs of other projects. In the end, we summarize the lessons learned.


Introduction
Enhanced Resolution Imager and Spectrograph (ERIS) 1 is an instrument which is currently (2022) under commissioning at the Very Large Telescope (VLT), European Southern Observatory (ESO) Unit Telescope 4 (UT4) telescope. ERIS is intended to upgrade the couple NAOS-CONICA + SINFONI and, to this extent, it was conceived as a modular device, with an adaptive optics (AO) module that can feed two different scientific instruments: an imaging camera (NIX) 2 and a spectrograph (SPIFFIER). 3 The partners involved in the construction are: INAF for the AO module 4,5 and the calibration unit (CU), 6 Max Planck Institute for SPIFFIER; STFC, ETH, and NOVA for the NIX camera; ESO itself for the wave front sensor (WFS) cameras and the real-time computing framework named SPARTA. 7 A picture of the entire unit is shown in Fig. 1, where the responsible institutes for each instrument are indicated.
The AO module takes advantage of the adaptive (or deformable) secondary mirror 8 and of the four-laser guide star (LGS) module; 9 these systems are permanent facilities at the UT4.
As mentioned, the AO system has been designed and integrated in INAF-Osservatorio Astrofisico di Arcetri. The CU module was instead developed by INAF-Osservatorio Astronomico d'Abruzzo. The AO module is composed by two independent subsystems, namely a natural guide star (NGS) and a LGS module. Those modules embed an optical system to illuminate the detector, a lenslet array (core of the Shack-Hartmann sensor), optomechanics and motion control devices to steer and stabilize the pupil position on the WFS cameras. The product tree includes the lenses, mirrors, actuators/stages, electronics controls, sensors, and cables.
The instrument has a requested operational lifetime of 10 years, as indicated in Table 1, where also the downtime and other relevant availability parameters are specified.
In order to meet such availability requirements, we performed a reliability, availability, and maintainability (RAM) analysis.
In the following sections, we will first describe our approach to the RAM analysis, focusing on its scope and methods; then we will present the requirements, the associated flow down, the actors, and reliability tree; in the end, we will show the system reliability database and compute the quantity of spare parts. Since the process is very general and may be applied to other instruments, the reader may take it as an example of RAM analysis for astronomical instrumentation.

RAM Analysis Approach to Telescope Instrumentation
RAM analysis has been applied to telescope instrumentation only in the last few years. Therefore, we tried to fit the standard analysis process (e.g., from industry and space projects) to our system; we now describe our approach as a possible checklist for future instruments.

Strategy
The full process can be described as in the following blocks.
Requirement analysis, flow down, and budgeting. The assessment of the RAM requirements and their flow down to subsystems and parts. A numerical value is attached for each element.
Identification of the main actors. The components and devices inside the unit, organized by type (e.g., electronics, connectors, and motors), in order to identify fault type and probability.
Creation of the reliability database. An inventory of parts, each with the associated information relevant to RAM (i.e., lifetime), mean time between failures (MTBF), and maintenance. Such information are collected from the vendor, manufacturer, or previous experience.  Design and analysis of the reliability tree. Diagrams listing the potential fault of each device in the units and the associated impact on the entire system.
Design guidelines and feedback. We first proceed to identify the system weaknesses (in terms of RAM). Then we address each of them with a design update and with suggested modifications and operations, e.g., maintenance procedures, components selection, cables routing, and architectural considerations.
Estimation of the spare parts quantity. A list of the spare parts, each with associated quantity, which are needed to meet the lifetime requirement. The amount of spare parts is computed from the desired lifetime of the instrument and individual (expected or from datasheet) lifetime of the part itself.
In this paper, we will expand the aforementioned elements in the frame of the ERIS-AO module.

Identification of Main Actors
From the reliability point of view, ERIS can be sketched as a block diagram as reported in Fig. 2. The CU provides the sources for the calibration of the entire system, which is mandatory for the commissioning and for the periodic maintenance but it is not needed for on-sky operations. The mechanical structure is supposed to have a lifetime and a reliability much larger than electronics and moving parts so it will be not considered in the RAM computation. SPIFFIER and NIX are the scientific instruments and can be considered as mutually redundant (cold redundancy, i.e., the second is ready to be activated when the first fails); then an efficient scheduler can overcome the failure of one of them, while keeping the ERIS system up and running.
Therefore, the key point here is to qualify the event of a complete failure: a significant loss of observing time caused by the fail either of the AO module or both scientific instruments.

Requirement List, Flow Down, and Budgeting
The requirements driving the system design have been provided by the customer (ESO) and are listed in Table 1. A deep evaluation of the customer requirements and a good translation into a quantifiable way is fundamental for a helpful and realistic reliability assessment. We will analyze each high-level requirement and derive the constraints for the AO module.

Lifetime
The ERIS requested lifetime is 10 years, therefore, each subsystem should guarantee a sufficient number of spare parts and the correct preventive maintenance to reach this target. As a starting point, an accurate component selection with adequate lifetime is mandatory. In addition, to properly quantify the components wear in time, we have to consider the effective duty cycle. The operation time requested by contract is 1200 h/year, to be compared with the telescope observing time, set to 3600 h/year. No specific requirements were given by the customer in terms of duty cycle but nothing is preventing a continuous use up to the requested time, each night. Moreover, it is likely that the system will not be removed from the Cassegrain focus during the rest of the year. Therefore, the working time for the electronics should be accounted all full nights.
The strategy chosen here is acting in two directions.
• Selection of reliable and off-the-shelf parts. Custom parts, as a matter of fact, are in general poorly characterized, hence an exhaustive individual RAM analysis is more complicated. • Update of the design to grant easy access to parts for replacement, in case of failure, and to allow the preventive maintenance.
For the AO module, we considered the same lifetime of the instrument.

Availability
Availability represents the probability that ERIS is operative when requested and is expressed as mean down time (MDT) and MTBF. The former represents the average time when the system is not available (i.e., the gap between the failure and the restart); the latter is the average time between two consecutive failures. As explained in Sec. 2.2, ERIS will be considered unavailable when the AO module or both SPIFFIER and NIX are unavailable at the same time. The MDT shall not exceed 24 h/year. From the customer (ESO) point of view, a failure is a stop of operations longer than 5 h. Every instrument malfunction that can be repaired within 5 h (even during observation time) is not considered as a failure. It follows that the design has to implement a modular approach allowing a reduced handling time for the replacement of broken parts. Another fundamental aspect is the concept of line replacement unit (LRU), that is a subsystem/module/component that can be replaced in situ by two technicians (as specified by the customer) in a period shorter than 4 working hours. As an example, an LRU is the module box including 12, 24, and 48 V power supplies: the parts subjected to failure are the individual power supplies, but the entire box is replaced in case of an individual failure, to speed up the maintenance procedure. It is clear that decomposing the system smartly in LRU is a winning strategy in terms of reliability, because they are not affecting lifetime and downtime.
The MTBF requested for ERIS is 1 year: the budget has been split equally between CU, AO, and scientific instruments connected serially 10 as described in Sec. 2.2. To compute it, we adopt the following equation: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 6 ; 4 1 5 where λ i is the individual failure rate and λ is the system failure rate, or the inverse of MTBF. Following the serial connection, the value considered for the whole instrument is

Design Guidelines and Feedback
We formulated some guidelines to improve the system design from a RAM perspective.
In particular, we focused on the following points.
• Motion devices. Parts with higher reliability shall be preferred, although suboptimal from a performance point of view.
• Cooling. Passive dissipation is to be preferred versus active, e.g., fans or coolant liquid.
• Not off-the-shelf components. Shall be limited as much as possible, due to lack of RAMrelated documentation as, for instance, lack of extended tests (in terms of duration and statistics) to assess MTBF and failure mode.
• High module density. Many components stacked together in the same subrack or container shall be avoided: this point pairs well with the LRU strategy, so that to limit the amount of items to be replaced in case of failure and to reduce, therefore, the downtime.
• Easy accessibility to modules and racks to speed up the maintenance.
• Extensive collection of telemetry data, for misbehaviors monitoring and failure avoidance.
• Component selection. Same component for similar tasks in different subsystems to reduce the RAM complexity, the maintenance procedures, and the quantity of spare parts.
We asked the design team to keep a close contact with the RAM team to check the need for redundancy as early as possible in the project timeline. In the end, the workflow may be summarized as follows: we analyzed the criticalities in the preliminary design, then we implemented a circular process of parts selection and feedback during the final design phase. Just to cite a few examples of such a feedback: we were forced to make the power supplies redundant and drop custom parts; we had some items [e.g., the atmospheric dispersor corrector (ADC) motor] whose primary selection was replaced because of high failure rate; and we selected the same rotator and tip-tilt mirrors for both LGS and NGS subsystems.
As a final comment, we considered such circular feedback process extremely valuable to improve the reliability and availability of the AO module. We implemented a strong link between the RAM specialists and the instrument designers since the earliest phases of the project, achieving a minimum number of design releases and an increased efficiency of the process.

Reliability Database
Before starting the analysis of the failure sources and relations, we need to build up a logbook or database to include the characterization of each item in terms of failure rate and operating conditions. Such operation may be very time-consuming and has to be scheduled well in advance.
We considered the following aspects: • failure rate; • working environment specification; • storage environment specification; • preventive maintenance; • corrective maintenance; and • spare parts availability.
The main issue encountered in this step is the estimation of the failure rates, necessary to feed the design process as mentioned before. Information could be gathered from the following sources.
Vendor datasheets. They are the first source of solid information (guaranteed by the producer), they are not always free, and often they are available on request.
International standards. They are a good reference for some basic components and for guidelines. On the other end, they might be not regularly updated and you could not find the newest items or some complex technological products. Some references could be found in Refs. 11 and 12.
Similarity. Tests are based on a similar, well known item, to draw a comparison. Laboratory test. It is possible to use failure rate estimations based on self-made tests. For the AO module, the database included more the 100 parts. Since very often the parts installed are the same in similar projects (e.g., the same motor or the same humidity sensor), it is clear that a well formulated database will be of great help for future projects.

Reliability Tree of the AO Module
At this point, time is to identify all the components within the AO subsystems and analyze their role on a fault perspective. The control electronics and all the power supplies are placed externally the instrument, inside three racks.
The AO module is composed by two main moving parts: natural guide star wave front sensor (NGS-WFS) subsystem in Fig. 3 and laser guide star wave front sensor (LGS-WFS) subsystem in Fig. 4. Further details can be found in bibliography. 4 In addition, the AO module is composed by a warm optics selector mirror used to feed alternatively each one of the scientific instruments. A nice way to draw the reliability tree is to follow the light path within the instruments, as shown in Figs. 3 and 4 with the purple beam.
In Fig. 5(a), there is the block diagram for the AO subsystem: a module failure may occur if the selector mirror or the NGS bench fails. The LGS bench is not affecting the lifetime computation because its purpose is to enhance the performance of the module and increase the sky coverage, but it is not strictly needed for functioning.   In Fig. 5(b), we draw all the devices of the NGS WFS bench: they are all serially connected. Same as above, two items can be excluded from the analysis: the ADC and the technical camera. The former is made by prisms rotated by motors, whose failure causes a slight loss of performances, whereas the latter is used to speed up the initial acquisition phase of the instrument.
The major (positive) impact of the LRU concept is on the control electronics: since they are located into separate cabinets with easy accessibility, the time needed to replace an electronic failing item is estimated to be inside the LRU definition. Therefore, it will not account in the overall MTBF computation but only in the availability and spare parts estimation. We assigned an operational time to each device and component, based on the expected usage. We considered three values: 100% (full night and day), for those electronics components always powered on and for safety hardware; 14% (all nights, 4 months/year), as requested into Table 1, for all the device that are actively operating during an observation night; and 2% for very low usage devices, considering conservatively 1 h/night. Now that all the possible failures are traced, we can fill the full product tree in Tables 2 and 3 with the relevant parameters: failure rate (or MTBF) and operational time (OpsTime). The result in the last column of the two tables is computed as the ratio MTBF/OpsTime for each component and combined to compute the total (or subsystem total) value with Eq. (1).
We obtained for NGS-WFS board (see Table 2) an MTBF value of 2.2 years, in series with the MTBF of the selector mirror (75 years, not actually affecting the final value), and for the LGS-WFS board (see Table 4) an MTBF of 1.8 years.
As a comparison, the MTBF for the CU module is 10 years, while for NIX + SPIFFIER is 11 years. Such a large difference with respect to the MTBF of both NGS and LGS is expected, since the WFS boards include several moving parts (stages and rotators).
In the end of our analysis, the total MTBF of the instrument is 1.5 years.

Spare Quantity Estimation
The amount of spare parts was estimated following the LRU concept together with the lifetime specification requests. The value can be computed considering the MTBF of each element and comparing it to the 10 years of requested lifetime. We can use a standard memory-less exponential distribution model approach to get the reliability function. 10 The probability density function in such a case is E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 6 ; 3 5 1 where t is the time and λ is the failure rate. The reliability function is defined as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 6 ; 3 0 7 A reasonable reliability value for our purposes is 90% at 10 years: this means that within this period, we will not expect failures due to parts wear with a 90% probability. According to Eq. (3), such a value corresponds to a MTBF of about 100 years. Then we listed all the components (or their critical parts) ordered by MTBF and we selected those under threshold; we decided to provide for those parts at least one spare each. Since all spare parts can be reasonably repaired (or ordered) in a time negligible respect to the MTBF of each component, we considered that a single spare part would be enough for the scope.

Conclusions and Lesson Learned
The ERIS system was integrated in 2020 to 2021 and installed at VLT-UT4 in December 2021; it is currently (2022) under on-sky commissioning before being released for science operations.
We implemented for the AO module an RAM analysis, which is a tool-box widely adopted to ensure a system is affordable, stable, and fully available along a specified lifetime; starting from a list of reliability parameters of the system components, we explored the failure modes and probability versus time and used the findings to improve the design, to plan preventive maintenance procedures and to evaluate the quantity of spare parts.
The analysis presented in this paper suggests a general scheme and may be taken as a template for other instruments. At the end of the integration and testing phase, we came up with a few lessons learned. Early failure. A very serious issue was the early malfunctioning and even break of some devices: for instance, we had several issues with the motorized iris of the LGS-WFS, during the assembly integration and verification (AIV) process. The lesson here is to make individual, extensive tests on the procured devices as part of the incoming inspection procedure.
Cables and connectors. A far too little attention was paid to cables and connectors. Indeed ERIS required a large amount of high pin-density connectors. Such cables are not easily handled by most manufacturers. Furthermore, such high-density connectors required stiff cables (selected for their robustness) which, in turns, requested very critical cable-to-panel insertion procedures, resulting often into damages of the cable or of the panel connector.
This aspect emerged late in the project life cycle, during the AIV phase, and the only possible corrective action was to rebuild the full set of cables. The lesson learned is to add a cable/ connector activity in a risk management plan. Such activity will require an early prototyping for risk mitigation with the advantage to explore newer and more suitable solutions.  Note: Bold is for result values for assembled parts (those parts indicated in italics in the table) and is the "cumulative result" of the components.
Custom hardware. The recommendation is to limit the usage of custom hardware as much as possible, trying to avoid their intrinsic low availability and development uncertainty. Another important aspect is the time required for delivering or fixing such items, which resulted in delay in the project timetable. A lesson learned is to foresee in the project the use of off-the-shelf parts as temporary replacements for the final devices.
Low crowding and clear mapping. The importance of low devices crowding is fundamental for the unavoidable maintenance/repair operations. Another fundamental point was to have a full mapping of cables and connectors, clearly reported on easy to use manuals.
As a brief summary, we learned that the major issues came from parts coupling (e.g., cable plus connectors) more than from the parts themselves and we experienced that is fundamental to run laboratory tests to validate the subsystems and to early identify the failures.
He was involved in the design, integration, and commissioning of the AO module for ERIS. Since 2012, he has also been a part of the INAF team committed to the optical calibration of the adaptive M4 mirror of the ELT.
Nicolò Azzaroli is a temporary staff member (post-doc) at INAF. He has been committed to the design, integration, and test of the AO module for ERIS. He is also involved in other AO projects such as M4, MORFEO, MAVIS, and ANDES.
Chiara Selmi is a temporary staff member at INAF. She has been committed to the design, integration, and test of the AO module for ERIS. She is also involved in other AO projects such as M4, MORFEO, MAVIS, and ANDES.