The Atacama Large Millimeter/Submillimeter Array (ALMA) Observatory, with its 66 individual radiotelescopes and other central equipment, generates a massive set of monitoring data everyday, collecting information on the performance of a variety of critical and complex electrical, electronic, and mechanical components. By using this crucial data, engineering teams have developed and implemented both model and machine learning-based fault detection methodologies that have greatly enhanced early detection or prediction of hardware malfunctions. This paper presents the results of the development of a fault detection and diagnosis framework and the impact it has had on corrective and predictive maintenance schemes.
The Atacama Large Millimeter/submillimeter Array (ALMA) observatory, with its 66 individual telescopes and other central equipment, generates a massive set of monitoring data every day, collecting information on the performance of a variety of critical and complex electrical, electronic and mechanical components. This data is crucial for most troubleshooting efforts performed by engineering teams. More than 5 years of accumulated data and expertise allow for a more systematic approach to fault detection and diagnosis. This paper presents model-based fault detection and diagnosis techniques to support corrective and predictive maintenance in a 24/7 minimum-downtime observatory.
The Atacama Large Millimeter/submillimeter Array is an interferometer comprising 66 individual high precision antennas located over 5000 meters altitude in the north of Chile. Several complex electronic subsystems need to be meticulously tested at different stages of an antenna commissioning, both independently and when integrated together. First subsystem integration takes place at the Operations Support Facilities (OSF), at an altitude of 3000 meters. Second integration occurs at the high altitude Array Operations Site (AOS), where also combined performance with Central Local Oscillator (CLO) and Correlator is assessed. In addition, there are several other events requiring complete or partial verification of instrument specifications compliance, such as parts replacements, calibration, relocation within AOS, preventive maintenance and troubleshooting due to poor performance in scientific observations. Restricted engineering time allocation and the constant pressure of minimizing downtime in a 24/7 astronomical observatory, impose the need to complete (and report) the aforementioned verifications in the least possible time. Array-wide disturbances, such as global power interruptions and following recovery, generate the added challenge of executing this checkout on multiple antenna elements at once. This paper presents the outcome of the automation of engineering verification setup, execution, notification and reporting in ALMA and how these efforts have resulted in a dramatic reduction of both time and operator training required. Signal Path Connectivity (SPC) checkout is introduced as a notable case of such automation.