Identification and adaptive control of a high-contrast focal plane wavefront correction system

All coronagraphic instruments for exoplanet high-contrast imaging need wavefront correction systems to reject optical aberrations and create sufficiently dark holes. Since the most efficient wavefront correction algorithms (controllers and estimators) are usually model-based, the modeling accuracy of the system influences the ultimate wavefront correction performance. Currently, wavefront correction systems are typically approximated as linear systems using Fourier optics. However, the Fourier optics model is usually biased due to inaccuracies in the layout measurements, the imperfect diagnoses of inherent optical aberrations, and a lack of knowledge of the deformable mirrors (actuator gains and influence functions). Moreover, the telescope optical system varies over time because of instrument instabilities and environmental effects. In this paper, we present an expectation-maximization (E-M) approach for identifying and real-time adapting the linear telescope model from data. By iterating between the E-step (a Kalman filter and a Rauch smoother) and the M-step (analytical or gradient-based optimization), the algorithm is able to recover the system even if the model depends on the electric fields, which are unmeasurable hidden variables. Simulations and experiments in Princeton's High Contrast Imaging Lab demonstrate that this algorithm improves the model accuracy and increases the efficiency and speed of the wavefront correction.


Introduction
In the upcoming era of 30-m ground-based telescopes and advanced space telescopes, direct imaging is believed to be the next frontier in exoplanet detection and characterization. Unlike indirect detection methods, such as radial velocity and transit, direct imaging collects light from the planet itself rather than its host star, thus enabling the spectral characterization of the planet's atmosphere and the full determination of its orbital parameters. But exoplanets are much fainter than their parent stars, requiring the starlight's point spread function (PSF) to be managed to make high contrast imaging of the exoplanet possible.
A leading technology for achieving the high contrast needed for exoplanet imaging is a coronagraph. [1][2][3][4] Consisting of a series of optimally designed masks and stops, coronagraphs are able to suppress the spread of starlight and thus create high-contrast detection regions, so-called dark holes, in the image plane. However, since the coronagraphs are designed assuming perfect optics, they are fundamentally sensitive to any wavefront perturbations. Even small aberrations introduce bright stellar speckles in the dark holes, which influence the instrument ability for exoplanet obervations. To maintain a high contrast for exoplanet observations, wavefront correction is required for all coronagraph instruments. In a ground-based telescope, the wavefront correction system typically includes a wavefront sensor, such as a spatially filtered Shack-Hartmann sensor or a pyramid sensor, 5,6 in the pupil plane to measure the wavefront aberrations and then directly compensates for them using deformable mirrors (DMs). Such a system is able to cancel pupil phase aberrations due to atmospheric turbulence and achieve contrasts of 10 −5 to 10 −6 on current telescopes, allowing for the imaging of young hot gas giant planets. 7 Directly imaging dimmer planets (down to Earth size) at higher contrast requires a space telescope that reaches contrasts below 10 −8 before postprocessing. 8,9 For these coronagraph instruments targeting earth-sized planets, the pupil plane approach with a separate wavefront sensor is not capable of reaching the required high contrast values because of noncommon-path errors. Instead, we need to estimate the focal-plane electric field using only camera images and compute the DM control signals for based on the estimated field. This estimation and control problem is commonly referred to as focal-plane wavefront correction (FPWC). Effective FPWC algorithms require efficient estimation algorithms and accurate models of the optical system, particularly of the influence of DM voltage commands on the focal-plane electric field.
In all current FPWC systems, the optical models needed for control and estimation have been derived by applying Fourier optics to the optical layout. However, using only the Fourier optics approach results in significant bias errors due to inaccuracies in measurements of the optical system, imperfect knowledge of the systematic optical aberrations, and poor or biased models of the DM influence. This has a detrimental effect on the wavefront correction speed and the final achievable contrast. Classical approaches for eliminating these model errors and improving the system performance include pupil plane phase retrieval 10,11 and laser interferometric DM surface characterization 12 in advance, which are usually time consuming and also introduce noncommon-path errors. In this paper, we propose a datadriven framework using the expectation-maximization (E-M) algorithm to accurately identify and adaptively control the FPWC system. In contrast to classical approaches, our method does not require a change to the optical system design, tracks real-time systematic changes, and speeds up convergence of the controller.
In the following sections, we first provide a brief overview of the FPWC system, including mathematical modeling and current state-of-the-art on wavefront estimation, control, and model calibration. In addition, we also propose a idea to formulate the problem as a stochastic optimization problem. Then, we review the E-M algorithm and derive the E-M equations of the FPWC system. We finally report the simulation and experimental results on the FPWC system identification and adaptive control in the Princeton's High Contrast Imaging Lab (HCIL) to demonstrate the method's ability.

Overview of High-Contrast Focal Plane Wavefront Correction
In this section, we review the current state-of-the-art in FPWC and we also introduce a idea to formulate FPWC as a stochastic optimization problem. In Secs. 2.1-2.3, we review the approaches of optical system modeling, wavefront estimation and control, and model calibration, which are related to our new algorithm. Readers already familiar with these subjects may skip these sections.

Mathematical Modeling
The FPWC system with a coronagraph is typically formulated as a state-space model for the convenience of control applications. In this state-space formulation, the control inputs, observations, and state variables are, respectively, the DM voltage commands, camera images, and the focal plane electric fields. We begin this section by deriving this underlying state-space model. The block diagram in Fig. 1 shows the architecture of the telescope optics and the control loop. We define E ab as the aberrated pupil electric field before the DMs and Δϕ m as the phase change introduced by the DMs at correction iteration m. The coronagraph operator Cf·g represents the propagation from the DM to the focal plane camera. After k correction iterations, the focal plane electric field is given by E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 3 2 6 ; 7 5 2 Δϕ m : (1) When the phase change due to the DM is small (typically, the DM surface perturbation is smaller than 30 nm), the focal plane electric field in Eq. (1) can be expanded in a Taylor series about Δϕ m to yield E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 3 2 6 ; 6 6 7 (2) where we used the fact that the coronagraph is a linear operator in the applied electric field (as it is composed of Fresnel propagations, Fourier transforms, and coronagraph mask multiplications).
The DM phase change, Δϕ k , at each step is approximated by summing weighted influence functions produced by each actuator on the DM. That is, given an influence function, f q , which represents the q'th actuator's response to a unit voltage input, the DM induced phase change across the pupil is approximated by the linear superposition: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 3 2 6 ; 4 9 0 where N act is the number of DM actuators and u k;q is the voltage command change of the q'th actuator. We use u k;q instead of Δu k;q in this paper for notational simplicity. Substituting Eq. (3) into Eq. (2) results in a linear relationship between the focal plane electric field and the DM voltage commands: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 3 2 6 ; 3 7 2 By discretizing and vectorizing the 2-D electric fields, Eq. (4) can be put in the common matrix form of the state transition model: where diagf·g represents the diagonal matrix constructed from a vector. Equations (5) and (8) make up the linear state-space model of the FPWC system.
The elementwise product structure in Eq. (8) decouples the linear state transition equations in each pixel, so the electric field of a single pixel can be estimated based only on its own intensity measurements. For mathematical convenience, we can further split the real and imaginary part of the electric field and derive real-valued state-space equations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 6 3 ; 2 0 5 x k;j ¼ x k−1;j þ G j u k ; z k;j ¼ H k;j x k;j ; where j ∈ f1; · · · ; N pix g is the index of camera pixels, and E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 3 2 6 ; 7 5 2 The elements of the state vectors and matrices are now real numbers, which is more convenient for developing estimators and controllers.
Good DM "probe" commands should help construct wellconditioned measurement matrices, H k;j , for all the pixels in the dark holes. Commands that create "Sinc" waves on the DM surface are usually good choices, because camera pixels in two symmetric rectangular areas are influenced by this type of probe according to Fourier analysis. 15

Focal Plane Wavefront Estimation and Control
With the state-space model developed in Sec. 2.1, we now introduce the wavefront estimation and control algorithms. The baseline wavefront estimation approach used in most implementations to date is the least-square, batch process estimator (BPE), 13,14 which can be derived as follows. We begin with the linear observation model in Eq. (9) but with an additive noise term, n k;j , to represent camera measurement noise and observation matrix errors (originally from Jacobian matrix errors): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 1 ; 3 2 6 ; 4 2 8 z k;j ¼ H k;j x k;j þ n k;j : (11) Theoretically, the camera measurements should follow Possion distributions. However, in our case, there are a sufficient number of starlight photons for detection, making it safe to assume the measurements follow centered Gaussian distributions on the top of the speckles, i.e., n k;j ∼ N ð0; R k;j Þ. We can thus perform a least-square regression to estimate the expectation,x k;j , and the covariance matrix, P k;j , of the state vector at each time step, k: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 2 ; 3 2 6 ; 3 0 8x k;j ¼ ðH T k;j H k;j Þ −1 H T k;j z k;j ; P k;j ¼ ðH T k;j H k;j Þ −1 H T k;j R k;j H k;j ðH T k;j H k;j Þ −1 : Repeating this regression procedure for all pixels provides an estimate of the entire electric field in the focal plane. Although this algorithm is easy to implement, one weakness is that its estimation accuracy, indicated by the estimation covariance, is fully determined by the measurement noises. When the signal-tonoise ratio is low (which happens as the dark hole improves), this BPE may not provide accurate enough estimates to be used for control. 16 A better solution is to incorporate prior knowledge from the model and the previous measurements using a Kalman filter. 17 This formulation allows us to introduce an additive process noise term, w k , to the state-space equations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 3 ; 3 2 6 ; 1 2 5 where w k;j ≅ ΔG j u k þ r k;j , of which the first term comes from the Jacobian matrix errors and the second term comes from system instabilities, such as DM drift. Assuming elements of the Jacobian matrix bias ΔG j and the instability term r k;j follow zero-mean Gaussian distributions (the model bias ΔG j is the difference between the true Jacobian matrix and the current Jacobian matrix in use. Each element of Jacobian errors, ΔG j , has zero mean, since we have no knowledge whether the current Jacobian influence is larger or smaller than the true value. This is intuitively true in real experiment because the biases of different actuators may have influence of different directions, so they will cancel with each other), the process noise also becomes an additive Gaussian noise, w k;j ∼ N ð0; Q k;j Þ, which satisfies the requirement of Kalman filtering. At control iteration k, the state transition model provides a prediction of the current state, since we havex k−1;j and P k−1;j from the previous estimation. With this prior knowledge, we can derive the log-likelihood function of the current state and the observations: log pðz k;j ; x k;j Þ ¼ − where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 5 ; 6 3 ; 4 8 2x k;jjk−1 ¼x k−1;j þ G j u k ; P k;jjk−1 ¼ H k;j P k;j H T k;j þ R k;j ; (15) are the a-priori state and covariance estimates and pðz; xÞ is the joint probability density function for z and x. The Kalman filter maximizes this log-likelihood function, so it optimally combines the information from the model and the observations and thus reduces the estimation covariance.
Recently, Riggs et al. 16 introduced an incoherent light term into the nonlinear observation model, Eq. (6): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 6 ; 6 3 ; 3 6 They then employed an extended Kalman filter (EKF) to simultaneously estimate both the coherent electric field and incoherent intensity (which contains the planet). This method removes the requirement for pairwise probing and makes simultaneous wavefront correction and planet detection possible. This EKF approach is not used in this paper; however, it will be applied to improve the system identification work described here in a future paper to characterize system nonlinearities.
With the state estimates available, it is now possible to compute the DM voltage commands to manipulate the focal plane electric field. For mathematical simplicity, we construct a real, linear state transition model by combining the state equations for each pixel [Eq. (13)] into a single vectorized form: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 7 ; 6 3 ; 1 8 8 where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 8 ; 6 3 ; 1 4 6 x k ¼ The controller must drive the state x k as close to zero as possible to maintain a high contrast in the dark hole. Currently, the two most popular model-based optimal controllers are electric field conjugation (EFC) 18 and stroke minimization (SM). 19 EFC works by minimizing a cost function consisting of the total energy in the dark holes and a Tikhonov regularization, which can be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 9 ; 3 2 6 ; 6 9 7 min u k x T k x k þ α k u T k u k ; s:t: where α k is the Tikhonov regularization parameter. By contrast, SM aims to find the smallest DM commands that achieve a target contrast, which can be formulated as the constrained minimization: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 0 ; 3 2 6 ; 6 1 6 min u k u T k u k ; s:t: where C k is the target contrast, or total energy, in the dark holes. The equality constraints can be incorporated into the cost function via a Lagrange multiplier, making SM into a similar formula to EFC: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 1 ; 3 2 6 ; 5 3 5 The optimal solutions of Eqs. (19) and (21) give two corresponding feedback control laws: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 2 ; 3 2 6 ; 4 7 5 where I ∈ R N act ×N act is the identity matrix. It is evident that EFC and SM, as more rigorously discussed by Groff et al., 15 in fact, define the same control law except for the tuning parameters, α k , and the Lagrange multiplier, μ k . The Lagrange multiplier, μ k , is a function of the target contrast C k , which is the tuning parameter in SM. Both α k and μ k introduce a damping term, although based on different considerations, in the matrix inversion, which helps avoid an ill-posed matrix inversion problem and, more importantly, reduce the influence of Jacobian matrix biases. Tuning the damping parameter, α k and μ k , which turns out to be nontrivial, is the key to properly implementing the controllers. 20

Model Calibration
Because the wavefront estimators and controllers are all modelbased, their performance highly depends on the accuracy of the underlying model. It is common to precalibrate the model based on some testbed measurments before running high-contrast FPWC. To date, all the model calibration approaches work to improve the Jacobian matrix in the linear state-space formulation.
As indicated by Eq. (5), the Jacobian matrix is fundamentally a function of the aberrated pupil electric field, E ab , and the actuator influence functions, f 1∶N act , so an indirect approach to improving the model is to characterize E ab and f 1∶N act separately and then compute the Jacobian matrix based on the coronagraphic propagation equation in Eq. (4). The influence functions are usually characterized using laser interferometry. 12 Since it is too time-consuming to measure the surface responses of all the actuators (several thousands on each DM), typically, only a few representative actuators are characterized with the assumption that all actuators have similar responses. The pupil electric field, though, cannot be directly measured. It is typically reconstructed from multiple focused and defocused images using phase retrieval algorithms. [21][22][23][24] However, all the phase retrieval algorithms assume a certain light propagation model, so they do not have the ability to diagnose any errors from an incorrect optical layout prescription. In addition, since the coronagraph typically blocks most of the light from the entrance pupil, there are very few photons to provide the needed information.
To fix this, current phase retrieval approaches require removal of the coronagraph to collect data; this makes the phase retrieval time consuming and prone to noncommon path error.
Recent work by Zhou et al. 25 started exploring system identification methods for determining the Jacobian matrix in favor of directly identifying the Jacobian matrix by perturbing the DM shapes and observing the resulting camera images. The physical interpretation of a Jacobian matrix column is the influence of a DM actuator with unit voltage command on the focal plane electric field. Therefore, by definition, the Jacobian matrix can be derived by commanding each actuator and estimating the focal plane electric field changes. The least-squared, BPE was employed in that work for the electric field estimation. However, since BPE requires a large amount of data and is relatively noisy, the identification procedure was time consuming and the resulting identified model was too noisy to be used in the wavefront correction. In addition, the identified model using BPE was also limited by the initial knowledge of the Jacobian matrix. Therefore, up to now, this work has only been used for qualitatively understanding the sources of the model errors, instead of quantitatively correcting the Jacobian matrix errors.

New Theoretical Results: FPWC as a Stochastic Optimization Problem
As can be seen in Secs. 2.2 and 2.3, the typical approach to focal-plane wavefront control is to examine the wavefront estimation, wavefront control, and model calibration as separate problems. In this section, we try to bridge these aspects by formulating the FPWC problem as a single stochastic optimization problem. As first shown by Sun et al., 26 this approach provides better physical insights into the tuning parameters in the algorithms and also provides theoretical ayalyses on how the wavefront control, estimation, and model accuracy influence the final contrast in the dark hole. The ultimate goal of the FPWC is to minimize the total intensity, x T k x k , in the dark holes. Since the state, x k , is a random variable, we can formulate FPWC as a stochastic optimization/ control problem that minimizes the expectation of the dark hole intensity, hx T k x k i. The state variable follows the stochastic process in Eq. (17). With the assumption that the process noise, w k , has a zero mean, the expectation at step k can be distributed as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 3 ; 6 3 ; 1 9 5 where the statistics of the previous state are provided by the past wavefront estimation, hx k−1 i ¼x k−1 and varðx k−1;j Þ ¼ P k−1;j . (The covariance matrix P k−1;j is an indicator for the estimation accuracy, which is also a function ofx k−1 .) The process noise, as explained in Sec. 2.2, includes the Jacobian matrix errors and the system instabilities, w k ≅ ΔGu k þ r k . Given r k ∼ N ð0; S k Þ, the process noise covariance in Eq. (23) The stochastic optimization problem can now be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 4 ; 3 2 6 ; 6 7 5 min S k is eliminated in the optimization because it is a constant covariance matrix. As this cost function indicates, the final contrast depends not only on the DM commands, u k , but also on the estimation accuracy, TrðP k−1;j Þ, and the Jacobian uncertainties, W. Minimizing the first four terms of the cost function over u k defines the wavefront controller, while minimizing the trace of the estimation covariance matrix, TrðP k−1;j Þ, overx k−1 defines the wavefront estimator. In addition, system identification or classical model calibration can be used to reduce the model uncertainties, TrðWÞ ¼ kΔGk 2 F , which also improves the final achievable contrast from the wavefront correction.
By definition, the entries of the regularization matrix are as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 5 ; 3 2 6 ; 4 5 9 where the subscripts, m and l, represent the column indices of the Jacobian bias matrices. Each column of ΔG gives the modeling errors of an actuator's influence, so W m;l indicates the covariance of Jacobian errors from the m'th and l'th actuators. In general, W is a symmetric positive definite matrix with nearly all the entries nonzeros. The off-diagonal entries in W disappear if the modeling errors of different actuators are assumed to be unrelated from each other. EFC or SM with scalar regularization further assume that the covariance of errors from different actuators is identical. Thus, given that E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 6 ; 3 2 6 ; 2 7 2 where W degrades to a scaled identify matrix: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 7 ; 3 2 6 ; 1 9 3 This shows that tuning the Tikhonov regularization parameter or Lagrange multiplier is equivalent to finding the magnitude of Jacobian uncertainties in our model. A smaller regularization parameter indicates smaller Jacobian errors, which finally leads to higher contrast according to Eq. (24). In the following sections, we will present the system identification and the adaptive control using the scalar regularization assumption in Eq. (26). This assumption is not fundamentally necessary for our E-M algorithm, but it will significantly simplify the algorithm implementation. Characterizing the filled regularization matrix (by assuming each actuator's error not independent) turns out to be very hard, because the high-dimensional system suggests that it is usually underdetermined and requires tremendous amount of data for identification. To proceed with this idea, we have to assume some known structure of the regularization matrix (For example, we can assume only neighboring actuators are coupled, which makes the matrix very sparse.) or incorporate dimension reduction techniques, such as principal component analysis or sigular-value decomposition, to reduce the number of adaptable parameters. We will leave these explorations for future work.

Expectation-Maximization (E-M) Algorithm
The stochastic optimization formulation in Sec. 2.4 indicates the potential from system identification for improving the wavefront corrections. Moreover, it indicates not only that identifying the Jacobian matrix is necessary, but also that characterizing the process and observation noises is important for tuning the optimal estimators and controllers. In this section, we develop an E-M algorithm-based approach 27 to accomplish all of these goals.

Review of the E-M Algorithm
The E-M algorithm is an iterative system identification algorithm to find the maximum a posteriori (MAP) estimates of the model parameters in the presence of hidden variables. 28,29 Hidden variables are the states of a dynamical system, which are not directly observable. Since the true values of the hidden variables are absent, we cannot explicitly derive the log-likelihood function and apply the maximum likelihood estimation (MLE) to identify the model parameters as usual. Instead, we maximize a lower bound of the log-likelihood of only the model inputs and outputs. In general, with the hidden variables, the model inputs and outputs (the commands and observations of a system, usually referred to as the training data), and the model parameters (coefficients parametrizing the model function), denoted as X, Y, and θ respectively, the log-likelihood can be written as an integral of the marginal probability over the hidden variables: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 8 ; 6 3 ; 3 2 5 Lfθg ¼ log pðYjθÞ ¼ log Assuming the hidden variables follow a probability distribution, QðXÞ, a lower bound on the log-likelihood, F ðQ; θÞ, can be found using Jensen's inequality: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 2 9 ; 6 3 ; 2 4 9 The E-M algorithm alternates between maximizing this lower bound with respect to the hidden variable distribution, QðXÞ, and the model parameters, θ. Optimizing over the distribution QðXÞ while fixing θ is called the expectation-step (E-step), and optimizing over the model parameters θ while fixing QðXÞ is called the maximization-step (M-step).
In the M-step, F ðQ; θÞ is maximized when ∫ X QðXÞ log pðX; YjθÞdX ¼ E X ½log pðX; YjθÞ, the expectation of the log likelihood is maximized. This is a stochastic MLE problem of the model parameters, θ.
Theoretically, the model parameter estimation always converges to a local minimum after enough iterations of the E-step and the M-step. The number of iterations it takes depends on the initial knowledge of the model parameters given to the algorithm. In FPWC, the model computed based on the Fourier optics can be used as the initial guess into the algorithm. Since it is pretty close to the true value, the parameter estimation converges within only one or two E-M iterations.

E-M Algorithm for FPWC System
The state-space model of the FPWC system defines a typical input-output hidden Markov process, where the focal plane electric fields are the hidden variables and the Jacobian matrix as well as the process and measurement noise covariance matrices are the model parameters, so the E-M algorithm is suitable for the this system. Moreover, since the dynamics of different pixels are decoupled in the FPWC system under the linear assumption, we can separately and in parallel identify the model parameters of each pixel separately, which saves a lot of computation time.
Here, we copy the state transition and observation equations defined by Eqs. (11) and (13): E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 1 ; 3 2 6 ; 3 1 7 x k;j ¼ x k−1;j þ G j u k þ w k;j ; w k;j ∼ Nð0; Q k;j Þ; z k;j ¼ H k;j x k;j þ n k;j ; H k;j ¼ 4u pT k G T j ; n k;j ∼ Nð0; R k;j Þ; where k ∈ f1; · · · ; N d g is the index of the control iterations, j ∈ f1; · · · ; N pix g is the index of camera pixels, and N d is the total number of the control iterations. Based on Eq. (31), the model parameters, hidden variables, and model inputs and observations for the single-pixel E-M algorithms can be, respectively, denoted as θ j ¼ fG j ; Q ;1∶N d j ; R 1∶N d ;j g, X j ¼ fx 0∶N d ;j g, and Y j ¼ fu 1∶N d ; u p 1∶N d ; z 1∶N d ;j g. By assuming the process noise w k;j ≅ ΔG j u k þ r k and the Jacobian errors from different actuators are independent, as shown in Eq. (26), the process noise covariance matrix is as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 2 ; 3 2 6 ; 1 3 9 where S k;j , the covariance from the system instability term r k , is assumed to be a constant scalar matrix, δ 2 I 2×2 , over iterations. In our following simulation and experiment, since the instability term is much smaller compared with the Jacobian bias, we neglect δ 2 (assume δ 2 ¼ 0) and only identify σ 2 to determine the process noise covariance. Without changing exposure time, the observation noise covariance matrix is also a constant scalar matrix over iterations: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 3 ; 6 3 ; 7 0 8 R k;j ¼ R j ¼ ν 2 I n×n ; (33) where ν is the standard deviation of the observation noise and n is the number of pairs of probes. As a result, the model parameters are simplified as θ j ¼ fG j ; σ 2 ; ν 2 g in the current E-M algorithm for the FPWC system. The E-M equations for X j , Y j , and θ j can be derived following the approach of Ghahramani and Hinton 30 As shown in that paper, for a linear Gaussian dynamical system like FPWC, the E-step can be achieved by Kalman filtering and Rauch smoothing and the M-step can be achieved by finding the analytical solution of a quadratic optimization problem. However, since the Jacobian matrix and observation matrix in FPWC have shared parameters and our control variables are high-dimensional, the model parameter update equations and the optimization method are a little different from the standard approach. The implementation details of the E-M algorithm for FPWC system are explained in the next section. For notational simplicity, we will omit the subscript j in the following derivations and discussions, understanding that the E-step and the M-step are repeated N pix times.

E-step
In what follows, we introduce notationsx k 1 jk 2 and P k 1 jk 2 , which represent the estimated expectation and covariance of the hidden states at control iteration k 1 given observations up to and including at control iteration k 2 . With these simplified notations, the conditional probability in the E-step becomes E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 3 9 ; 6 3 ; 1 2 1 P kjk ¼ ðI − K k H k ÞP kjk−1 ; wherex kjk−1 and P kjk−1 are the a priori knowledge of the states and covariance matrix from observations up to control iteration k − 1, andx kjk and P kjk are the a posteriori estimates updated by the observations at step k. Rauch smoother then propagates the states backward from the last step to the starting step and further updates the estimates based on the data of the future steps. Mathematically, Rauch smoother is a Kalman filter using the next hidden state as the observation. The Rauch smoothing equations are as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 0 ; 3 2 6 ; 6 7 5 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 1 ; 3 2 6 ; 6 4 2x E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 2 ; 3 2 6 ; 6 1 6 wherex kjN d and P kjN d are the estimated hidden state's expectation and covariance based on all the N d steps of data,

M-step
The M-step defines a stochastic MLE problem. Based on the Markovian structure of Eq. (31), the log likelihood of the hidden states, model inputs, and observations is as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 3 ; 3 2 6 ; 4 9 2 LðG;Q;RÞ ¼ log where E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 4 ; 3 2 6 ; 2 9 3 The expectation of this log-likelihood can be calculated using the state estimates in the E-step. Therefore, we can estimate the model parameters by taking the derivatives of the log-likelihood with respect to each parameter and forcing the resulting expectations to be zero: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 7 ; 6 3 ; 6 6 3 Q ¼ E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 4 8 ; 6 3 ; 5 8 8 Equation (46) is an implicit equation, so G needs to be found recursively. From our earlier assumption, Q and R are forced to be scaled identity matrices: where we accordingly obtain as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 5 0 ; 6 3 ; 4 3 9 σ 2 ¼ The process covariance, σ 2 , can be used in the EFC algorithm for computing the Tikhonov regularization parameter, as shown in Eq. (27).
One shortcoming of this analytical solution is the large matrix inversion in Eq. (46). To ensure the matrix is invertible, we have to collect several thousand steps (greater than the number of actuators on DMs) of data before making an update, which is unnecessarily time-consuming and also precludes online system adapting. In order to update the model with a smaller amount of data, we can use a stochastic gradient ascent algorithm instead for updating the Jacobian matrix: where the tuning parameter η defines the learning rate of the algorithm. However, this method may not be able to reach exact optimal solutions. These two subsections presented all of the E-M equations for FPWC system. By repeating the iterative E-M approach on all the pixels we can reconstruct the linear state-space model for the entire system. While that is sufficient, it is helpful to apply a final step, forcing the process and observation noise matrices of all the pixels to be equal to their average. Since all pixels in the dark hole share almost the same noise distributions, neglecting the small difference in photon noises, this step enhances the robustness of the E-M algorithm.
The remainder of the paper will present two ways to apply the E-M algorithm to the FPWC system, offline system identification, and online adaptive control. In Sec. 4, we identify the system using precollected data and try to understand the sources of aberrations in our system. In Sec. 5, we integrate the E-M algorithm into the control loop and adapt the model parameters and control policy in real time. Simulation and experimental results are reported in both cases.

E-M Algorithm-Based System Identification
In this section, we numerically and experimentally investigate the E-M algorithm-based system identification for FPWC. Our goal for the system identification is to precisely characterize the Jacobian errors. In addition, we will also take this chance to understand important algorithmic details, for example, the influence of the hyperparameters (batch size and amount of data) on the algorithm's performance or how hard it is to characterize different types of model errors.
The experiment is conducted in the Princeton's HCIL and the simulation uses the same setup. As shown in Fig. 2, the HCIL testbed is a two-DM FPWC system with shaped pupil (SP) coronagraph. It utilizes a ripple pupil plane mask to suppress the contrast by changing the starlight PSF. In addition, a bowtie shaped focal plane mask blocks the center part of the PSF to avoid camera saturation, which also defines the dark hole regions for the FPWC. Each DM in the HCIL has 952 actuators. Without loss of generality, we only activate the first DM in simulation and experiment. The second DM is treated as a fold mirror.

Data generation
In the numerical study, we simulated the DM commands and resulting camera images under an imperfect lab condition. Wavefront aberrations with 10-nm RMS were added to the shaped pupil plane and two DM planes. The DM actuators' gains were biased by 20% to account for the influence function errors. (The influence function shape errors were neglected in our simulation, however, the E-M algorithm is able to handle this type of errors as proved in the experimental results.) Shot noises and readout noises were added to the simulated camera images, where the noises' standard deviations were chosen based on measurements in the HCIL. In the numerical model, the masks are modeled as 0-1 binary matrices, the propagations through the OAPs or lenses to their focuses are modeled as Fourier transform, and all the free-space propagations between devices are modeled as Fresnel propagations.
To sufficiently explore the controllable space of the DM, we generated the data by applying random DM commands in the system identification approach. In our simulation, in total 4000 random voltage commands (between −0.6 − 0.6 volts) were applied to the DM and the resulting camera images were simulated. A fixed exposure time of 0.1 s was used for the camera images. For each random DM commands, we collected two pairs of probing images, so we have in total 16,000 images (2 images∕pair × 2 pairs∕command × 4000 commands) in our data set. The random commands between −0.6 and 0.6 V typically result in contrast changes at a level of 1 × 10 −6 . In order to make the DM influence significant enough for learning so that the effect is larger than the background speckles, in our simulation, we first ran wavefront control for four steps to reach a contrast of roughly 3 × 10 −6 and then applied the random DM commands and generated the images. Same "probe" comands were used for all 4000 data points. Although identical pairwise probes are not necessary for the E-M system identification, as as will be discussed, it helps us build a metric to evaluate the effectiveness of the identification.

Evaluation metrics of the identification accuracy
Three metrics were used to evaluate the model errors in our analysis. The first is the percentage error of the E-M identified Jacobian, G EM , compared with the true Jacobian including optical aberrations and influence function biases, G: (52) The second metric assumes we are blind to the true Jacobian matrix (which is true in the experiment); we thus reserve part of the data as a validation set. Theoretically, the difference between two neighboring observations with the same probing commands is a function of only the DM commands: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 5 3 ; 6 3 ; 3 3 So, we can define a percentage validation error of the identified Jacobian matrix, via E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 5 4 ; 6 3 ; 2 6 9 Validation Error ¼ where N v is the number of data steps in the validation set. The scale of validation error could be a little different from Jacobian error since it actually measures the difference between G T EM G EM and G T G instead of G EM and G, however, they should have similar trends and are both good indicators of model accuracy.
The third metric that indirectly reflects the accuracy of a Jacobian matrix is the correction speed and the final achievable contrast of the wavefront control using it. With a more accurate Jacobian matrix, the wavefront control should achieve a higher contrast with fewer control iterations.

System identification results
In this section, we applied the E-M algorithm-based system identification in various ways to the simulation data to test the algorithm. The analytical method in Eqs. (46)-(48) and the gradient ascent method in Eq. (51) were, repectively, tried to solve the stochastic MLE problem. For the gradient ascent method, we also examined the effect of using different batch sizes. The batch size is a machine learning term referring to the number of data points utilized in one E-M update. Theoretically, small batch sizes enable timely model parameter updates and timeefficient parallel computing, but sacrifice the accuracy of each update because the hidden states estimation with small batch sizes has relatively larger covariance. The algorithm was also investigated with different numbers of data points. Our goal for this section is mainly to validate the reasonability of the evaluation metrics defined in the previous section, and to compare the performance of the algorithm given different optimization methods, batch sizes and amount of data using these metrics. Figure 3 shows the change in the Jacobian errors and the validation errors with respect to the number of data points. We saved the last 500 steps of data for validation, so at most 3500 data points were used for system identification. Results using the analytical method and the gradient ascent method with the batch sizes of 2, 10, 100, and 500 are reported. As shown in the figure, the validation error curves resemble the Jacobian error curves, validating it a good metric of model accuracy in the experiment. The stochastic descent algorithm works with a wide range of batch sizes all with similar validation errors, though too small a batch size underperforms compared with others. The analytical method does not work with fewer than 1500 data points because of the ill-posed matrix inversion in Eq. (46). However, it outperforms the gradient ascent once given enough data. The identification accuracy primarily depends on the number of data points used, no matter what optimization methods or batch sizes we apply. Figure 4 shows the simulated wavefront correction using the original biased model (computed using Fourier optics with no knowledge of the true aberrations), the true model (computed using Fourier optics with full knowledge of the true aberrations), and the best identified model (analytical solution using 3500 data points). EFC with a fixed regularization parameter and batch process estimation with two pairs of probing commands was used in this simulation. As can be seen, the identified model beats the biased model in both the wavefront control speed and the final contrast. The contrast gap between the true model and the biased model is significantly reduced after the E-M system identification.

Data collection
The same sampling policy was used in experiment as in simulation: we ran the wavefront correction to reach a relatively high contrast (settling at around 3 × 10 −6 ), applied 4000 random DM commands (between −0.6 and 0.6 V), and collected the resulting difference images, saving the last 500 steps as the validation set. Again, two pairs of DM probes were used for observation at each step.

Identification results
With the validation error proved to be a good metric, now we use this metric to evaluate the identifcation results with the experimental data. As shown in Fig. 5, the validation error curves of various cases decrease with the same trends as in Fig. 3(b), showing that the E-M algorithm also successfully detects and corrects the Jacobian errors in the experiment.
Further analysis of the sources of Jacobian errors in experiment can be found in Sec. 7. As shown by this regression analysis of the identified Jacobian, DM actuator's gain errors and pupil plane wavefront phase aberrations explain around half of the model errors in our experiment (Other errors may be the influence function shape errors, the wavefront aberrations on the plane of other devices and the system nonlinearities beyond the algorithm's identification ability.). Among these factors, the DM gain errors are easily corrected with only a few of data, while the wavefront aberrations are corrected slower and also varies over time.
We also compared the wavefront control results using the identified model and the original/biased physics model. In  the physics model, we had no knowledge of the wavefront aberrations and assumed the same gain and influence function shape for all the actuators. Similarly, EFC and batch process estimatiton were used in all the wavefront correction trials. Figures 6(a) and 6(b), respectively, show the wavefront control curves (contrast versus control iteration) using the analytical Jacobian solutions and the gradient ascent Jacobian solutions with different amount of data. In both cases, the wavefront corrections with the identified models are much faster than the biased physics model in the early stage; they all reached a contrast better than 3 × 10 −7 within only four to five control iterations. However, the analytical Jacobians did not perform better than the gradient ascent solutions as expected. After reaching a high contrast, the analytical Jacobians experienced some difficulties in correcting the small residual aberrations, resulting in a final contrast slightly worse than the physics model. We speculate that the analytical E-M solutions are overfitted to the data noise. By contrast, the gradient ascent solutions reached the same ultimate contrast as the physics model. This is mainly because the achievable final contrast in the lab is currently limited by the scattered, incoherent light. On conclusion from these results is that the gradient method is better for experimental applications. In addition, the wavefront correction speed did not improve much as the number of data points increased. This may be because the key factors that influence the wavefront correction speed, probably the DM actuator's gain errors, as discussed in the appendix, were detected and corrected with only tens of data points and/or offline system identification did not handle the time-varying data well.

E-M Algorithm Based Adaptive Control
The experimental results in Sec. 4 demonstrated the ability of the E-M algorithm to improve the Jacobian accuracy, even with only small amount of data. However, this system identification workflow (data collection-identification-wavefront correction) cannot keep up with some of the most important time-varying errors, such as thermally induced phase aberrations. In this section, we present an E-M algorithm based real-time adaptive control framework, or more specifically, a reinforcement learning control framework, to solve this problem. This reinforcement learning control strategy is not fundamentally different from the E-M algorithm based system identification; we use the same algorithm developed in Sec. 3 but only directly feed the wavefront correction data instead of the precollected data with random DM commands into the E-M equation.

Reinforcement Learning for FPWC
Reinforcement learning control has attracted much attention recently as an important branch of machine learning. In reinforcement learning, the system, or agent, alternately runs a control policy to explore the environment and an adaptation step that varies the policy based on the information from the control step. Since the agents directly learn from the control attempts, it is more efficient for them to find the best control policies and track the model variations in real time. This technique has been widely applied to training complex control systems, such as those playing the game of Go 31 or video games, 32 robot manipulation, motion planning, and locomotion. 33 Figure 7 shows the block diagram of the proposed adaptive FPWC system. It combines the wavefront estimation and control with the E-M system identification presented in Sec. 3. In this scheme, we no longer use random DM commands for identification. Instead, the DM commands and resulting images from the control loops are sent to the E-M algorithm to update the model instantanesouly. The new adaptive FPWC system now loops between running steps of wavefront estimation and control and updating the model parameters (which also means updating the control and estimation policy). In addition, not only is the Jacobian matrix, G, identified in the adaptive/ reinforcement learning control step, so too are the process noise, σ 2 , and observation noise, ν 2 , as demonstrated in Eqs. (49) and (50). These are then used to tune the wavefront estimator (the covariance matrices of process noises and observation noises in Kalman filter) and controller (Tikhonov regularization matrix in EFC) based on Eqs. (27), (32), and (33). As a result, the Kalman filter estimator better balances the weights of the model predictions and observations, and the controller better chooses the damping parameter in the wavefront correction. In our software implementation, we introduce a hyperparameter, γ, to Eq. (27), which defines a modified regularization matrix, W 0 ¼ γW ¼ 2γN pix σ 2 I, because we found the controller is usually able to be more aggressive than the theoretical suggestion.

Reinforcement Learning Simulation
Again using the imperfect lab conditions that result in phase aberrations and actuator gain biases, as stated in Sec. 4.1.1, we simulated the reinforcement learning control for 50 control iterations. Two pairs of probing images were collected at each iteration for wavefront estimation. In Sec. 4, we used same pairwise probes for the convenience of validation error calculation; however, here, we allowed the DM probes to vary among different control iterations in the reinforcement learning control simulation. After every 10 control iterations, we supplied the control commands (10 steps), the pairwise probes (2 pairs/ step × 10 steps), and the camera images (2 images/pair × 2 pairs/step × 10 steps) to the E-M algorithm to update the Jacobian matrix and the tuning parameters in the estimator and controller. For comparison, the wavefront control with the true Jacobian model and the fixed biased Jacobian model was also simulated. In both of these benchmark cases, the Kalman filter and the EFC controller were tuned to the best manually. Figure 8 shows the results of the three simulations. As can be seen, the reinforcement learning control gradually closed the contrast gap between the biased model and the true model. The E-M adaptation at every 10 iterations can be clearly seen on the correction curves.

Reinforcement Learning Experiment in HCIL
In this section, we present the results of using the reinforcement learning adaptive control approach in the HCIL. Unfortunately, because the ultimate contrast achievable in the HCIL is limited to roughly 1.5 × 10 −7 due to incoherent background light (as seen in Figs. 6), it is not possible to reproduce the simulation results from the previous section. There, the adaptation step was run after each 10 iterations of the control. But as can be seen in Fig. 8, the modeled system reaches a contrast better than the lab limit of 10 −7 in fewer than 10 steps, before the first reinforcement learning step. Through trial and error, it was found that the E-M algorithm cannot robustly identify the system with fewer than 10 learning steps. Therefore, to experimentally verify the algorithm, we limited each FPWC run to 10 control iterations and updated the model parameters using the E-M algorithm after each trial. The Jacobian and tuning parameters were then used for the next trial of wavefront correction. As shown in Fig. 9(a), the rate of convergence of the wavefront correction became faster after each learning trial. Note that we ran the E-M identification after every learning iteration, however, to keep the figure clean, we only report a few of the typical results (wavefront control with the initial biased model and after 1, 5, 10, 15, 19 learning trials). After only 19 learning trials, the FPWC system was able to reach 1 × 10 −6 in one control step and below 2 × 10 −7 contrast in three control steps, which is faster than the results from the offline system identification. This indicates the wavefront control provided more informative data compared with random DM commands. One possible explanation is that the controller in wavefront correction more frequently moves the DM actuators not blocked by the coronagraph masks, and the parameters of these actuators (corresponding columns of the Jacobian matrix) are actually the key parts to improve the wavefront correction. As a contrast, the random command policy indistinguishably moves all the actuators, which may not be efficient. The reinforcement learning framework may also have captured some time-varying errors. However, since our testbed is pretty stable over short time intervals, this should not be the main reason that the reinforcement learning control outperformed the system identification. Figures 9(b)-9(d) show the changes in the estimates of process noise and observation noise covariances and their ratio at each learning trial. As shown in these figures, we underestimated the noise levels at the beginning. The adaptive controller quickly corrected these incorrect assumptions. Then, the adaptive controller gradually corrected the errors in the Jacobian matrix, so that the process and observation noise covariance estimates decreased with additional learning trials. More details about the adaptive control experiment can be seen in the video in Fig. 10.
By using this reinforcement learning approach, much effort is saved, and accuracy gained, by not having to take testbed layout measurements, perform phase retrieval and surface characterization, or having to manually tune the controller and estimator parameters. The reinforcement learning adaptive control  results also shows promise for enabling self-maintenance of the FPWC during the mission.

Conclusion and Future Work
Efficient and successful focal plane wavefront control and estimation in coronagraph instruments requires accurate modeling of the optical system. In this work, we first proposed an E-M algorithm to identify the optical system as a linear statespace model. According to the simulation and experimental results in the Princeton's HCIL, the algorithm successfully corrects model errors such as those produced from errors in the DM gains and initial phase aberrations. Use of the identified models significantly increases the rate at which the wavefront correction converges. We also developed a model-based adaptive/reinforcement learning control scheme based on this E-M algorithm. The adaptive controller alternates between the wavefront correction and the model parameter self-adaptation, which significantly improves the performance of both the estimator and controller and requires only tens of learning iterations. This approach is very promising for the automatic maintenance of the FPWC system in future space missions. Future work will focus on generalizing this framework with more realistic assumptions. First, we plan to identify the full matrix regularization suggested in Sec. 2.4 instead of the scalar regularization. This will help us understand the interactuator couplings that are neglected by EFC and SM, as well as improve the performance of the wavefront correction. Second, we also plan to drop the linearity assumption and use EKF and neural networks to approximate the optical system as a nonlinear system. The linear assumption does not hold when we need large DM surface chages to correct the influences from telesocpe struts and/or segmented apertures. By introducing system nonlinearities back into the model, we should be able to further increase the speed and efficiency of the wavefront corrections, gain a deeper contrast, and better extract the exoplanet signal.

Appendix A: Regression Analysis of the Sources of Jacobian Errors
The Fourier optics analysis in Eq. (5) shows that the Jacobian errors primarily come from errors in the pupil field, E ab , and the influence functions, f 1∶N act . Thus, we can analyze the sources of the Jacobian errors by fitting E ab and f 1∶N act to our identified Jacobian matrix, G EM . After rearranging the real-valued Jacobian matrix, G EM , back into the complex form, F EM , based on Eq. (10), the fitting problem can be formulated as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 5 5 ; 3 2 6 ; 7 1 9 min E ab ;f 1∶N act kFðE ab ; f 1∶N act Þ − F EM k 2 F : (55) The pupil electric field and influence functions are, respectively, parameterized as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 5 6 ; 3 2 6 ; 6 5 8 where Z m and β m are the Zernike polynomials and their coefficients, f is the shape of the influence function, and ρ q are the actuator gains. For simplicity, this parameterization neglects amplitude wavefront aberrations and the difference of influence function shapes among actuators. With this parameterization and Taylor expansion in Eq. (56), the fitting problem in Eq. (55) becomes a simple linear, least-square regression in the parameters β m and p q . Figure 11(a) compares the validation errors of an identified model (gradient ascent solution with batch size of 500 in Sec. 4.2) and its fitted model. The validation errors from only fitting with the DM gains or Zernike phase aberrations are also reported. As shown, the fitted model explains more than half the model errors identified by the E-M algorithm, which in part proves our guess about the major sources of model errors. More interestingly, the DM gains are accurately characterized with only the first 500 data points, so the corresponding validation error curve (red) decreases rapidly at the beginning, but changes little as the amount of data increases. By contrast, the validation error from the phase aberrations regression (blue) keeps decreasing as the amount of data increases without reaching plateu. This indicates that the phase aberrations are hard to to correct and may be slowly changing while collecting the data, so the identification algorithm keeps adjusting the Zernike coefficients as the data amount increases. Actually, the curve slope becomes even sharper in the end, because the data in the end may have more similar pupil aberrations as the validation data. The first five fitted Zernike coefficients with respect to the number of data points are further reported in Fig. 11(b). The defocus and vertical astigmatism do not change much, while the tip, tilt, and oblique astigmatism vary over time, which satisfies our observation that the center of the PSF shifted for one pixel horizontally and vertically, respectively, in our experiment after collecting 4000 data points. This explains why the marginal benefit of data decreases. Moreover, this also justifies the advantage of adapting the system in real time.