While the resolution of the sensors in the remote sensing space missions tends to increase, the bandwidth of the communication channel remains comparatively stable. As a consequence, the reduction of the captured data volume on-board satellites by means of compression is becoming ever more important, and becomes almost imperative for multispectral and hyperspectral sensors, since the amount of data captured is substantial.
When selecting or designing an algorithm for on-board data compression, we have to consider both, its coding efficiency and if the algorithm is suitable to be efficiently executed on the hardware that is available on a satellite. Among the possible hardware technologies that can be used for on-board data compression, field-programmable gate arrays (FPGAs) stand out, yielding a good compromise between performance and cost, as they present multiple advantages, such as the ability to apply parallel processing to increase throughput, provide flexibility, scalability, and data integrity features. However, space-qualified FPGAs, which are tolerant to solar radiation, are more difficult to manufacture, increasing their cost and reducing their performance. As a consequence, the requirement of low complexity for an on-board compression algorithm becomes even more important.
The Consultative Committee for Space Data Systems (CCSDS) offers currently a solution for lossless multispectral and hyperspectral image compression (CCSDS 123).1 This solution has been proven to take a reduced amount of hardware resources,2 fitting in a small space-qualified FPGA.3 Nevertheless, although lossless compression has been traditionally preferred to preserve all the information in the image, the current bandwidth and on-board storage restrictions make it necessary in some scenarios to resort to lossy compression in order to reduce the data volume appropriately. As a consequence, the CCSDS is currently working toward the definition of a lossy standard as an extension of the existing CCSDS 122 standard for image compression.4 The latter defines a two-dimensional (2-D) image coder that combines a wavelet transform with a bitplane encoder and that, if coupled with a spectral transform, would take advantage of the spectral correlation present in multispectral and hyperpectral data.
Transform-based methods for compression are known to be well-suited for lossy compression. However, they are also known to be usually computationally complex in terms of the number of operations and memory requirements. In particular, in Ref. 5, several three-dimensional (3-D) transform coding techniques for lossy hyperspectral image compression are compared, showing increased rate-distortion performance when a spectral Kahrunen–Loève transform (KLT) is employed, especially in the low bit-rate region. However, the KLT presents important disadvantages when its implementation on hardware is considered: it has a high computational cost and has high memory requirements. All these factors lead to a higher implementation difficulty, and often become insuperable obstacles when considering an implementation on space-qualified FPGAs.
A divide and conquer strategy was proposed in Ref. 6, in which the pairwise orthogonal transform (POT) is presented. The POT is a simple adaptive transform that only requires a small amount of side information, which, in turn allows a line-based application of the transform not previously available to other divide-and-conquer approaches (or the KLT). In most cases, it has a better coding performance than wavelet transforms and, at the same time, it overcomes the inconveniences of the KLT, achieving a notably lower computational cost and better component scalability, with very low memory requirements (a few orders of magnitude less than a full KLT). The POT has a better performance than the wavelets, although it does not reach the coding performance of a full KLT.
This work aims at validating that the low complexity of the POT makes it suitable for a space-qualified FPGA implementation. A register transfer level (RTL) description of the main arithmetic stages of the POT, namely the mean subtraction, the training of balanced and unbalanced pairwise operations, and the lifting network are provided and optimized in order to of achieve a low occupancy of the hardware resources, making it possible to synthesize the design on a space-qualified RTAX2000S or on an RTAX2000S-DSP, which contains additional dedicated digital signal processor (DSP) blocks. In order to limit the dynamic range expansion of the transform, the variant of the POT introduced in Ref. 7 is the one being implemented.
Despite the POT being significantly less demanding in terms of hardware resources than the KLT, in this paper we make it even more amenable for an implementation on a space-qualified FPGA. In particular, a description of the mathematical operations in the transform is provided, in such a way that they can be solved with integer arithmetic, making the transform fully reversible and bit-exact (i.e., output values are unambiguous down to the bit level), which enables progressive lossy-to-lossless and makes implementations much easier to validate. Moreover, the most computationally demanding operations of the transform are solved by means of lookup tables (LUTs).
We note that, despite the POT being proposed to be included by the CCSDS in the upcoming standard for lossy multispectral and hyperspectral compression, the final standard may be different from the description of the operations in the POT presented in this paper. Moreover, the POT is only a part of the whole compression engine, which is being designed such that existing implementations of CCSDS-122.0, such as the ASICs described in Refs. 8 and 9, can be leveraged to create a complete compression engine.
This paper is structured as follows: Section 2 provides an overview of the arithmetic elements in the POT in a bit-accurate way, the subsequent Sec. 3 reports the hardware design considerations, explaining the methodology followed to create an RTL description of the latter operations, Sec. 4 shows the implementation results and finally, Sec. 5 presents the conclusions.
The Pairwise Orthogonal Transform: A Bit-Exact Definition
The POT is an approximation of the KLT at a fraction of its computational cost. It is a composition of three kinds of operations that, when combined into a multilevel structure, are employed to obtain the transformed image. Namely, these operations are the mean-subtraction operation, the balanced pairwise operation, and the unbalanced pairwise operation. For conciseness, this paper does not include a detailed description of the POT or the implemented variant; descriptions which are already available in Refs. 6 and 7.
In the following subsections we present a definition of the arithmetic operations of the transform, which makes it amenable for an implementation on the hardware available on a satellite. Specifically, the operations are described in a bit-exact way, utilizing only integer arithmetic. This will facilitate significantly the subsequent physical implementation of the POT. In this paper, the implementation target is an FPGA of the Microsemi RTAX family, which is currently space-qualified and is flown in current space missions. Reducing the complexity of the arithmetic operations becomes even more important due to the limited computational power of these devices, when compared with other nonqualified FPGA models that can be found nowadays in the market.
The mean-subtraction operation maps an input sequence of integer values of size to an output sequence , modifying the original values such that the output sequence has an arithmetic mean of approximately zero; see Fig. 1.
For a given input sequence, the value is an approximation of the arithmetic mean of , obtained as
Given , the result of the mean-subtraction operation is for .
The pairwise operation is defined in such a way that the output of the operation produces sequences that share virtually no correlation among each other.
Let and be two one-dimensional input sequences consisting of integer samples each, and with an arithmetic mean of approximately zero. Given these sequences, the pairwise operation produces two output sequences, and , consisting of samples each, and with an arithmetic mean also of approximately zero, as shown in Fig. 2
The pairwise operation changes in function of its current inputs. Two training parameters and are first derived from the inputs and then used to specialize the operation. These parameters are required to invert the operation and are stored as side information.
Additionally, the pairwise operation has a Boolean input, named flip, which changes the sign of all the elements of both output sequences.
The integer constant indicates the precision of pairwise operations. It must be set within the interval [9, 16] and it is constant for the whole compression process of an input image. This constant can be used to regulate the complexity of the implementation at the expense of minor coding performance penalties.
As a prerequisite to calculate the training parameters, and the scaled-up variances and covariance of and should be calculated as follows:
The training parameters and are then calculated as follows. When ,
Otherwise, when ,
After obtaining and , the decorrelated output sequences can be obtained through a lifting network. The lifting network, requires computing the weights , , and , and two flags: one to indicate whether the result shall be sign-changed (flag ) and the another to signal whether the outputs and shall be permuted. Hence, the result of the pairwise operation (the values and ) is produced by the lifting network shown in Fig. 3, where the values and are affected by three lifting steps (with weights , , and ), and the result is conditionally sign-changed and conditionally permuted.
Weights , , and are obtained from and as follows. Let
If , then
Output values shall be sign-changed when is . Output values shall be sign-changed when and is , or when and is 1.
Unbalanced Pairwise Operation
When the POT is applied to images with a number of spectral bands that is not a multiple of two, some pairwise operations need to be slightly modified, yielding the unbalanced pairwise operation described next (Fig. 4).
An unbalanced pairwise operation is a pairwise operation, as described previously, except for the following two variations.
The training parameters and are calculated as follows. When ,
Otherwise, when ,
The weights , , and are defined as follows: if , then
Hardware Implementation of the Arithmetic Elements of the Pairwise Orthogonal Transform
We present in the following sections the implementation of the operations in the POT in order to estimate its complexity in terms of utilization of hardware resources and assess the feasibility for an implementation on a space-qualified FPGA. More specifically, the following operations are described at RTL: mean-subtraction; computation of the variance and covariance; computation of the training parameters, and ; the computation of the weights; and the lifting network. A flowchart showing the relationship between the aforementioned operations in the POT is displayed in Fig. 5.
Main Design Considerations
The main objective of the presented hardware implementation is to optimize the amount of FPGA resources taken by the design; hence, we focus on reducing as much as possible the complexity of the operations, scheduling them sequentially when possible. For instance, the integer division operations which are needed to compute the mean values and to obtain the training parameter are solved by means of serial divisors in order to reduce the amount of hardware resources.
It is decided to split the design in several modules, which are described in VHDL. We designed a module that implements the mean subtraction, variance and covariance calculation, and the necessary operations to obtain the training parameters and , as it was described in Sec. 2. Once and are available, it is necessary to calculate the weights. From the equations presented in Sec. 2, we can infer that these operations are rather complex and might become computationally expensive for an FPGA. We have observed, however, that it is possible reduce the complexity and computational load of the overall design by utilizing a LUT that, given the parameters , , flip, and a flag that indicates is the transform is balanced or unbalanced, will produce the desired weights (, , and ); the sign change; and the flag indicating that the output shall be permuted. Finally, a module which implements the lifting network is also described in VHDL. The basic block diagram of the design is shown in Fig. 6.
In order to find out which is the best implementation solution for the LUT, from which the weights are obtained, we perform an estimation of the amount of resources needed to implement it only with combinatorial logic. The results are presented in Table 1, where it can be observed that the number of combinatorial cells needed to implement such LUT is highly dependant on . Given that the combinatorial cells required for such LUT rise considerably as grows—with, for example, a 14% of the RTAX2000S resources employed for —an alternative solution is proposed in order to reduce the area occupancy, which is to store the necessary LUT values in the internal RAM of the FPGA. The values can be stored in an external electrically erasable/programmable read-only memory (EEPROM), that would be loaded to RAM upon system start-up.
Combinatorial cells needed to implement the LUTs.
Register Transfer Level Description
We present the RTL description of the module that performs the POT. This module will receive the input sequences and produce the decorrelated sequences, as specified in Sec. 2.
The designed module, shown in Fig. 7, contains an input interface to a memory where the necessary samples are stored. A memory controller is implemented to access this memory. The decorrelated sequences (detail and principal) are obtained at the output, together with the necessary side information. A valid signal is included to validate the results. First, the input samples are subtracted their average and the variance and covariance values are calculated, and then the training parameters are obtained. From the training parameters and , it is possible to obtain the weights, using a LUT. As it was mentioned in Sec. 4, we assume the values in the LUT are in an external EEPROM, that is loaded to the FPGA internal RAM upon system start up. The internal RAM blocks are protected with an error detection and correction scheme able to correct single bit errors and detect double bit errors in the memory content. Finally, the decorrelated values are finally obtained from the weights using the lifting network module.
The following parameters are set as generics in the VHDL code, in such a way that it is possible to estimate how the occupancy of the design varies with them.
Additionally, we include a parameter to indicate if the number of input samples is a power-of-two. In the case the parameter is set, the implemented divisors, which are used to calculate the mean values, are avoided and the division is performed with bit shifts instead.
The synthesis of the previously described module has been performed on an RTAX2000S and an RTAX2000S-DSP, which are manufactured by Microsemi and are space-qualified. The main difference between them is that the RTAX2000S-DSP contains dedicated DSP blocks. Several different configurations of the top module are synthesized, varying the parameter that indicates if the number of input samples is a power-of-two (P2), varying the number of input samples (), or varying the number of bits per pixel (bpp). In all cases is set to 16. The different combinations of parameters can be observed in Fig. 8. We note that, even when is a power-of-two, if we do not inform explicitly the core with the corresponding parameter, the value will be interpreted as a nonpower-of-two value, i.e., the integer divisors will be used, they will not be replaced with bit shifts. This is the case of configurations C5 to C8.
Table 2 shows the hardware occupancy and maximum frequency of the implemented module obtained for the different configurations in both target FPGAs. The occupancy results are presented in terms of the percentage of combinatorial cells, sequential cells, and blocks of RAM (BRAM) utilized. For RTAX2000S-DSP device the percentage of used DSP blocks is also displayed.
Synthesis results on a RTAX2000S and on an RTAX2000-DSP for different configurations.
|COMB (%)||SEQ (%)||BRAM (%)||Frequency (MHz)||COMB (%)||SEQ (%)||BRAM (%)||DSP blocks (%)||Frequency (MHz)|
In order to facilitate the comparison of the presented results, the occupancy values are also displayed in the graphs of Figs. 9 and 10. We can observe that the complete module implementing the POT can be synthesized on an RTAX2000S for any of the presented configurations. The most demanded resource are the combinatorial cells in the RTAX2000S and the internal BRAM in the RTAX2000S-DSP. Comparing both FPGA models, we can see that the number of combinatorial cells occupied is reduced for the RTAX2000S-DSP, which has taken advantage of the DSP blocks that it contains to perform specific computations that, otherwise, would be performed with combinatorial logic. We can also observe that the occupancy of resources is significantly higher when the input samples are represented with 16 bpp. However, there is not a significant increase in the number of resources utilized when we change the number of input samples from 256 to 512.
The frequency results can be observed and compared in Fig. 11. It can be seen that the frequency increases significantly with the RTAX2000S-DSP. The maximum and minimum frequencies for the RTAX2000S are 37.58 and 29.92 MHz, respectively. For the RTAX2000S-DSP the maximum frequency is 73.50 MHz and the minimum frequency is 53.83 MHz. The frequency values decrease as the bit-width of the input samples increases. No significant differences in the frequencies are found when the number of samples increases.
The intent of this paper is to show the feasibility of an FPGA implementation of the POT, and as such the implementation has not been optimized for throughput. Anyhow, a brief discussion follows to illustrate on which could be the throughput of the whole compression engine. The proposed implementation is able to produce a transformed sample every four clock cycles, which yields a minimum throughput of for configuration C1 and a maximum of for configuration C8.
To complete the whole 3-D compression engine, a 2-D compressor implementing the CCSDS-122.0, such as the CCSDS wavelet image compression (CWICOM) ASIC, is utilized after the POT. The CWICOM is able to operate at up to .8 While the throughput of both elements is of similar magnitude, future optimizations of the POT implementation will be focused on improving its throughput, in such a way that it can yield one transformed sample per clock cycle.
This paper examines the hardware complexity of the arithmetic elements of the POT, namely the mean-subtraction operation, the training of the transform, and the lifting network. A bit-exact description of the operations is provided and utilized to create a hardware implementation with reduced complexity that only utilizes integer arithmetic. A LUT is used to eliminate the most computationally demanding operations of the POT. We perform the synthesis of the described hardware on two space-qualified FPGA, namely the RTAX2000S, and the RTAX2000S-DSP with a different number of input samples represented with different bit depths. We observed that for any of the studied configurations, the complete design fits in the target FPGAs. The worst-case in terms of occupancy shows a 74% of combinatorial cell usage in the RTAX2000S. The best-case for the same FPGA has a maximum occupancy of 24.13% of the combinatorial cells. Furthermore, we observed that the RTAX2000S-DSP offers the possibility of reducing substantially the amount of combinatorial cells used, by utilizing dedicated DSP blocks. This FPGA also yields higher frequency values, displaying a maximum frequency of 73 MHz.
These results show that the POT, which is only a part of a whole compression engine for multispectral and hyperspectral images, presents a low hardware complexity, making it a good candidate for an on-board hardware implementation.
This work has been partially supported by the Spanish Government, FEDER, and the Catalan Government under grants TIN2012-38102-C03-03 and 2014SGR-691. This work has been partially supported by Thales Alenia Space Espaa and by the Centre National d’Etudes Spatiales. An earlier version of this work has been published in the conference proceedings of the European Space Agency Centre National d'Etudes Spatiales On-Board Payload Data Compression Workshop 2014.
Lucana Santos received her PhD degree in 2014 from the University of Las Palmas de Gran Canaria (ULPGC). Currently, she is a postdoctoral researcher at the Integrated Systems Division of the Institute for Applied Microelectronics (IUMA). She was a visiting researcher at the European Space Research and Technology Center in 2011, where she did research on hardware architectures for hyperspectral image compression. From 2012 to 2014 she participated in an industrial project for Thales Alenia Space España.
Ian Blanes received his BS, MS, and PhD degrees in computer science from the Universitat Autònoma de Barcelona (UAB) in 2007, 2008, and 2010, respectively. Since 2003, he has been with the Group on Interactive Coding of Images of the UAB, where he currently holds a postdoctoral position. He was a visiting researcher at Centre National d’Etudes Spatiales. He was the second-place finisher as the best Spanish computer science student.
Aday García received his telecommunication engineer MSc degree from the ULPGC in 2013, where he has been working towards his PhD at the Integrated Systems Design Division of the IUMA. Since 2008, he has been working in the aerospace industry and has been involved in the development of on-board data processing boards for several European Space Agency programs, such as EUCLID and the Solar Orbiter and Meteosat Third Generation missions.
Joan Serra-Sagristà received his PhD degree in computer science from UAB, Spain, in 1999. From 1997 to 1998 he was at the University of Bonn, Germany. Currently, he is an associate professor at UAB. His research interests focus on source coding and data transmission. He serves as an associate editor for IEEE Transactions on Image Processing. He has coauthored over 125 publications. He was the recipient of the Spanish Intensification Young Investigator Award in 2006.
José López received his BS degree in physics from the University of Seville in 1989 and his PhD degree in telecommunication engineering from the ULPGC in 1994. He is a full professor in electronic technology and deputy director of the IUMA. In 1992, he was with Thomson-CSF, France. He was an invited researcher at the Technical University of Denmark (1995) and at Edith Cowan University, Australia (1997 and 2000).
Roberto Sarmiento received his PhD degree in electronic engineering from the ULPGC in 1991. He is a full professor at the Telecommunication Engineering Faculty at the ULPGC. He is a cofounder of the IUMA and director of the Integrated Systems Design Division of IUMA. Since 1990, he has published more than 50 journal papers and book chapters and more than 140 conference papers.