Autonomous aeroamphibious invisibility cloak with stochastic-evolution learning

Abstract. Being invisible ad libitum has long captivated the popular imagination, particularly in terms of safeguarding modern high-end instruments from potential threats. Decades ago, the advent of metamaterials and transformation optics sparked considerable interest in invisibility cloaks, which have been mainly demonstrated in ground and waveguide modalities. However, an omnidirectional flying cloak has not been achieved, primarily due to the challenges associated with dynamic synthesis of metasurface dispersion. We demonstrate an autonomous aeroamphibious invisibility cloak that incorporates a suite of perception, decision, and execution modules, capable of maintaining invisibility amidst kaleidoscopic backgrounds and neutralizing external stimuli. The physical breakthrough lies in the spatiotemporal modulation imparted on tunable metasurfaces to sculpt the scattering field in both space and frequency domains. To intelligently control the spatiotemporal metasurfaces, we introduce a stochastic-evolution learning that automatically aligns with the optimal solution through maximum probabilistic inference. In a fully self-driving experiment, we implement this concept on an unmanned drone and showcase adaptive invisibility in three canonical landscapes—sea, land, and air—with a similarity rate of up to 95%. Our work extends the family of invisibility cloaks to flying modality and inspires other research on material discoveries and homeostatic meta-devices.

• Supplementary Note 3: Comparison between spatial-only and spatiotemporal metasurfaces • Supplementary Note 4: Broadband generalization of intelligent aeroamphibious cloak • Supplementary Note 5: Generation-elimination network • Supplementary Note 6: Supervised learning loss and the accuracies • Supplementary Note 7: Gyroscope detector • Supplementary Note 8: Camera and environment discrimination network (EDN) • Supplementary Note 9: Realization of an intelligent electromagnetic detector • Supplementary Note 10: Automatic control system of the spatiotemporal metasurfaces • Supplementary Note 11: Working flowchart of intelligent invisible drone • Supplementary Note 12: Experiment setup of invisible drone against amphibious background

Other supplementary information includes:
• Supplementary Videos 1 & 2 (.mp4 format).This video shows the experimental setup and results at different outdoor test sites.When the intelligent invisible drone freely flies in the sky and passes through a conical detection region, three receivers record the on-site scattering wave in real time.At the same time, we compare the performance of cloaked drone with bare drone.

Supplementary Note 1: Design of the meta-atom
To realize the spatiotemporal metasurface and obtain the phase state with uniform coverage of 2π, the first step is to design a high-performance programmable metasurface [S1-S3].The geometrical details of the spatiotemporal meta-atom are illustrated in Fig. S1 and comprise an irregular octagon metal patch and two metal strips patched on the dielectric substrate.The dielectric substrate used in the design is F4B with   = 2.65 and a thickness of 3 mm.The period  of the unit cell is 40 mm.
The bottom layer is also a copper patch, serving as the ground to realize the reflective metasurfaces.
Two PIN diodes (SMP1320-079L from Skyworks) are placed between the octagon patch and two strips, which act as biasing lines for each diode.A metal via hole with a radius of 0.5 mm is further employed on the octagon patch to connect the top layer with the ground.The commercial software CST Microwave Studio is applied to carry out full-wave simulations and investigate the electromagnetic response of the unit cell.In the simulations of the unit cell, periodic boundary conditions are applied along both x and y directions, and Floquet ports are used along the -z direction.A normally incident plane wave (with x-polarized electric field) is assumed to calculate the reflection coefficient of the unit cell under different PIN states.In the optimization stage, we investigate the biasing voltage of two PIN diodes with four states (on-on, on-off, off-on, and off-off) through adjusting  1 ,  2 and  3 with fixed width  1 ~  5 .We observe that four distinct reflection responses are achieved with about 90° phase difference at around 3.1 GHz, while the reflection amplitude remains almost unity when

Supplementary Note 2: Physical principle of the spatiotemporal metasurfaces
We consider microwave reconfigurable metasurfaces that incorporate two electronic PIN diodes.By applying different dc voltage, the reflection response of metasurfaces can be switched among four discrete states, (on, on), (on, off), (off, on), and (off, off).We optimize the geometries of metasurfaces to attain an interval of π/2 among four states while keeping the amplitude as high as possible.On this basis, we introduce time-varying modulation into metasurfaces to revamp reconfigurable metasurfaces into spatiotemporal metasurfaces [S4-S7].Here, we consider periodic time-varying series that consists of L segments, and the value of each segment is one of the four discrete reflection states.The reflection state keeps constant in each segment.Mathematically, such time-varying series can be written as, where  is the number of time-varying series, and   is the reflecting coefficient at th segment.
() is the gate function, expressed as, where  is the period of time-varying series,  is an integer, ranging from 1 to .Evidently,   () is a periodic square-wave signal, which is nonzero only in the  ℎ segment.According to Fourier theorem, () can be decomposed into a sum of a series of orthogonal complex exponential functions with different angular frequencies: where k is the order of the complex exponential term, ∆ = 1/ is the frequency of the first order complex exponential term,   is the Fourier coefficient of the k th complex exponential term.In frequency domain, each complex exponential function is associated with a single frequency harmonic.
The frequency of the k th harmonic is ∆.And the amplitudes and initial phases of the harmonic is expressed by the Fourier coefficient   .The Fourier coefficient   can be calculated as, From the above equation,   is determined by the time-varying series.In turn, by suitably designing the time-varying series, we can actively alter the magnitude and phase of a given harmonic wave.By changing the period of time-varying series, the frequency of each harmonic can be varied.We simplify Eq. (S4) as, Therefore, as indicated by Eq. ( S5), the time-varying reflection coefficients () can be decomposed into the sum of a collection of complex exponential items, each of which is linked with a harmonic.
To facilitate the understanding, we consider the magnitude and phase of   as the synthetic reflection coefficient for the k th harmonic wave.If  = 1, the spatiotemporal metasurfaces are degenerated into the basic spatial-modulated metasurfaces.This is consistent with Eq. ( S5) because   = 0 with  ≠ 0. If  = 1 and  = 0, Eq. ( S5) is simplified to  0 =  1 .

Figure S2 | Synthetic state with different initial state and different length of time-varying sequence.
With the increase of the length of time-varying sequence, the number of synthetic states increase, gradually occupying the entire complex plane.The results in the figure are all at the main frequency.
The distance between the synthetic state and the original point denotes the reflection amplitude, and the angle with the x axis denotes the reflection phase.
When  enlarges, the number of synthetic states increase.We note that a time-varying sequence can only induce a unique   , but a   can be induced by more-than-one time-varying sequences.
We plot the synthetic state with different  and , as shown in Fig. S2, where  is the number of initial states.When  = 2, we consider the amplitude of two initial states is unity, and the phase different is π. = 8 will produce 2 8 time-varying series but with only seven synthetic states (because many of them are overlapped) at the main frequency.Figure S2 exhibits a general trend that the number of synthetic states increase with the increase of  and .In our work, we consider  = 4 and  = 8, and the synthetic states almost occupy the entire complex plane, which provide a high degree of freedom to freely manipulate electromagnetic waves.

Supplementary Note 3: Comparison between spatial-only and spatiotemporal metasurfaces
To benchmark the superiority of spatiotemporal metasurfaces, we compare them with spatial-only metasurfaces for the same far-field customization task [S8].For a given far-field, we optimize the profile of spatial-only and spatiotemporal metasurfaces using genetic algorithm (GA).The flowchart of GA is illustrated in Fig. S4a.The initial population is decoded into a group of metasurfaces, and the corresponding scattering performances are simulated and evaluated by minimizing a cost function.
Then a genetic process (selective reproduction, crossing over, and mutation) is performed to update the individuals until an optimal coding matrix is found.In Fig. S4b, we randomly generate three farfield patterns and mimic them with spatial and spatiotemporal metasurfaces (8 × 8), respectively.For spatial metasurfaces, each meta-atom has four discrete states (Supplementary Note 1) that can be chosen.For spatiotemporal metasurfaces, the states of each meta-atom can be freely picked from the synthetic states and initial states (Fig. S3).Evidently, the far-field pattern enabled by spatiotemporal metasurfaces is highly consistent with the target (in both shape and value), in sharp comparison with spatial metasurfaces.It suggests that, by using spatiotemporal metasurfaces, a more powerful ability in manipulating electromagnetic waves will be reached.and spatial modulation to imitate them.We find that the spatiotemporal metasurfaces give a high fidelity not only in pattern shape but also in numerical value.
for intelligent algorithm, i.e., the generation-elimination network.We should add another frequency channel and then train the generation-elimination network in a similar manner.The difficulty is attributed to the metasurface physical performance and the modulation speed for spatiotemporal metasurfaces.In the following, we will specifically illustrate how to reach this goal.For example, under the illumination of incident wave with the bandwidth from  0 to  3 , we aim to completely suppress the scattering wave (Fig. S5a).For the incident frequency component  0 , it will generate a series of harmonic waves that affect the scattering wave at other frequencies.Similarly, for any incident frequency component from  0 to  1 , it will generate a series of harmonic waves.
Thus, for the scattering wave at  3 , it is contributed by the zero-order harmonics induced by incident wave at  1 , first-order harmonics induced by incident wave at  2 , second-order harmonics induced by incident wave at  1 , third-order harmonics induced by incident wave at  0 , labelled as 0,1,2,3 (Fig. S5b).Similarly, for the scattering wave at  0 , it is contributed by -3,-2,-1,0.If we can exactly engine the reflection amplitude of these orders become zero, then the scattering field will be zero.Typically, the lower the order, the greater the reflection amplitude.However, we find it is possible to make these low-order harmonics become zero, which relies on the optimization of time-varying sequence and the physical properties of the meta-atom.Broadband and reconfigurable meta-atoms have been widely studied [S4].For example, the double-layered meta-atom in the inset of Fig. S5a that operations for mean (" ") and standard deviation units (" ") followed by the Gaussian sampling ("Sampled").Both "FC" and "fc" refers to the fully-connected layer; "relu" refers to the ReLU activation function; "linear" refers to the linear activation function; "sigmoid" refers to the sigmoid activation function; "T_FC" is the abbreviation of transposed fully-connected layer and "F_FC" is the abbreviation of the fully-connected layer in the forward network.
The second step.The whole CVAE-based [S9] generation-elimination network participates in this core step.The parameters of the pre-trained forward network are fixed when training the entire network.
As shown in Fig. S6, the recognition module is composed of three fully-connected layers, encoding the concatenation of the Input and Label into lower dimensions that are used to input into the latent space.The reconstruction module is composed of a concatenation operation and four transposed fully-connected layers, decoding the sampled latent variables into 20-dimensional design parameters.

The third step.
There is no further training in the last inference step, and only the decoder and the forward network are involved.The rounding operation is carried out in this step.The sampled variables from the standard Gaussian distributions combined with the Label (i.e., the desired radar cross section, RCS) will be firstly decoded into countless candidates (i.e., 20-dimensional design parameters) [S10-S12].Then, the design parameters will be rounded and transformed into 181dimensional RCS value through the forward network.The best candidate is selected by finding the minimal RCS deviation with the Label.Here, we define the deviation as the Manhattan distance.
Other metrics such as Euclidean distance and correlation distance could be adopted to screen out the corresponding best candidate or increase the loss on the main lobe if only focusing on that.
As elucidated in the third step and depicted in Fig. 3a, the generated candidates should be rounded before being sent into the forward network for further elimination.The 20-dimensional candidate is composed of 10 groups of [amplitude, phase].The key point is that, because of the limited choices in the spatiotemporal modulation of metasurfaces, each of [amplitude, phase] group needs to be approximated to one of the 81 points (synthetic reflection states in Fig. 2c of the main text).Certainly, we should first convert [amplitude, phase] groups into the [real, imaginary] coordinate before the where the first term is KL divergence loss (evaluating the similarity between the approximate variational posterior and the prior probability), the second term is the reconstruction loss of  (calculated as the negative maximum likelihood) and the third term is the prediction loss of  (calculated as the MSE over the point).Notice that  is the 20-dimensional input,  is the 181dimensional label variable and  is the latent variable.The deterministic RCS value of ′ is predicted from the forward network based on the predictive distribution   (|) .The hyperparameter  is set as 2,000.To quantity the performance of our network, we systematically define two criteria.(1) MSE, the mean square loss between the predicted RCS and ground truth RCS. ( 2 where ℯ  is defined as the average error between the predicted RCS and ground truth RCS, that is, , where   (  ′ ) represents the th data point of the ground truth (predicted) RCS, and  is the number of RCS points.
Figure S9 displays the summary statistics of two quantitative criteria when trained with the forward network and generation-elimination network, separately.Without reducing learning rate (reduceLr) measure [S14], the accuracy can be as high as 99.15% for the forward network.For the generation-elimination network, the reduceLr measure does little help to the accuracy (from 97.67% to 97.68%), while increasing the dimension of the latent space improves the accuracy to some extent, from 96.86% of 2-dimension to 97.68% of 10-dimension.For comparison and reference, we also provide the accuracy of the predicted RCS that is selected without rounding operation on candidates, that is, the generated candidates are directly sent into the forward network for elimination.Further, we take other optimization measures such as batch normalization [S15] and dropout [S16], none of which can bring obvious improvement for the network performance.

Supplementary Note 7: Gyroscope detector
In our system, an attitude sensor (HWT905-232) is applied to real-time recognize the drone's gesture so that the invisible drone can customize scattering wave in specific direction (Fig. S10a).The attitude sensor is a high-performance 3D motion attitude measurement system based on micro-electromechanical system (MEMS) technology, including three-axis gyroscope, three-axis accelerometer, three-axis electronic compass and other motion sensors.By integrating various high-performance sensors and attitude dynamics algorithm engines, the drone can be provided with a three-axis attitude angle (, ,  in Fig. S10b) with high accuracy (the measurement accuracy is 0.05°), high- 1 = 21 mm,  2 = 35 mm and  3 = 17 mm.

For
the same configuration of time-varying series, the synthetic states for different harmonics are different.Illustrated in Fig. S3 are the results with  = 4 and  = 8.It generally shows that the reflection amplitude is holistically compressed and becomes smaller for high-order harmonics.Notice that different harmonics are not completely decoupled.

Figure S3 |
Figure S3 | Synthetic state in different harmonic waves.In this figure,  = 4 and  = 8.It is evident that the synthetic states gradually gather together.The amplitude of the synthetic state becomes smaller for high-order harmonic waves. 0 is the frequency of incident wave, i.e.,  = 0.

Figure S4 |
Figure S4 | Mimicking the far-field with spatial-only and spatiotemporal modulated metasurfaces.a, Flowchart of genetic algorithm.b, For the three randomly-given targets, we use spatiotemporal

Figure S5 |
Figure S5 | Broadband realization of intelligent aeroamphibious cloak.a, The spatiotemporal metasurfaces suppress the scattering wave in broadband.Inset shows a broadband meta-atom.b, Principle of reaching zero reflection in broadband.

Figure S6 |
Figure S6 | Structure and parameters of the generation-elimination network.The recognition module combined with the Input and Label composes the encoder, while the reconstruction module combined with the Label composes the decoder.The latent space includes fully-connected

Figure S8 |
Figure S8 | The loss of forward network and generation-elimination network over epochs.a, The training and validation losses of the forward network.To avoid overfitting, the network is earlystopped at 161 epochs with the patience set as 50 epochs.b, The training and validation loss of the generation-elimination network.Similarly, the network is early-stopped at 209 epochs with the patience set as 30 epochs.

Figure S9 |
Figure S9 | Two criteria to quantitatively evaluate the performance of networks upon different configurations."reduceLr" refers to the reduce learning rate measure."dim" is the abbreviation of dimension.
dynamics and real-time compensation.The attitude sensor is connected with a six-in-one serial port conversion module to achieve USB-232 digital interface conversion and data serial input/output, which is driven by the CP2102 driver.To intuitively show the working effect of the attitude sensor in the experiment, we use the 3D attitude model in the built-in upper computer to illustrate the realtime attitude of the drone.Postures of three random moments are captured in Fig.S10cwith environment, first, we use the CSI camera to achieve the transmission of rtsp stream and explore the dynamic acquisition of multiple streams.Then, we have the picture of the background environment in millisecond scale with the size of 1920 × 1080 × 3 , as shown in Fig.S11a, the background information of three random moments is recorded.Finally, the picture is resized into 32 × 32 × 3 and further input into the EDN for environmental discrimination, which is constructed as a classification network.The structure of EDN is presented in Fig.S11b, mainly containing the feature extraction module (2 convolutional and 2 max-pooling layers) and the classification module (3 full connection layers).The output of EDN is one of four backgrounds coded as 0, 1, 2, 3. Adam optimizer is employed to update the parameters to complete the training of the model[17][18].During training, the learning rate and batch size are set to 0.001 and 16, respectively.Forty environmental pictures are collected as training data, which can be divided into four types, including grass, cement, playground and water (Fig.S11c).The training result is shown in Fig.S11d, where the validation loss converges well with the training loss and the accuracy of EDN measured by the other 10 testing data is 100%.

Figure S11 |
Figure S11 | Principle and function of the attitude sensor.a, Diagram of the CSI camera and the realtime environment capture at three moments.b, Detailed structure of EDN, which mainly contains feature extraction and classification module.c, Environment database of EDN, including four types: grass ( 0 ), cement ( 1 ), playground ( 2 ) and water ( 3 ).d, The training result of EDN.

Figure S16 |
Figure S16 | Flow chart of the operation process of Jetson.As the core control board of the invisibility system, the whole operation process of Jetson can be divided into perception, decision, and execution modules, for self-adapting to the ever-changing background and detection manner.