Optical network-on-chips (ONoCs) will play an important role for optical interconnects in the next generation chip multiprocessors (CMPs). Recent advances in silicon integrated photonics make it viable to develop ONoCs using the
standard CMOS process. This paper will introduce our work on cascaded-multiring-based tunable filters and ring-based integrated switchable wavelength routers.
Auto white balance (AWB) is an important technique for digital cameras. Human vision system has the ability
to recognize the original color of an object in a scene illuminated by a light source that has a different color
temperature from D65-the standard sun light. However, recorded images or video clips, can only record the
original information incident into the sensor. Therefore, those recorded will appear different from the real scene
observed by the human. Auto white balance is a technique to solve this problem. Traditional methods such as
gray world assumption, white point estimation, may fail for scenes with large color patches. In this paper, an
AWB method based on color temperature estimation clustering is presented and discussed. First, the method
gives a list of several lighting conditions that are common for daily life, which are represented by their color
temperatures, and thresholds for each color temperature to determine whether a light source is this kind of
illumination; second, an image to be white balanced are divided into N blocks (N is determined empirically).
For each block, the gray world assumption method is used to calculate the color cast, which can be used to
estimate the color temperature of that block. Third, each calculated color temperature are compared with
the color temperatures in the given illumination list. If the color temperature of a block is not within any of
the thresholds in the given list, that block is discarded. Fourth, the remaining blocks are given a majority
selection, the color temperature having the most blocks are considered as the color temperature of the light
source. Experimental results show that the proposed method works well for most commonly used light sources.
The color casts are removed and the final images look natural.
Systems-on-chips provide single-chip solutions in many embedded applications to meet the applications size and power requirements. Media processing such as real-time compression and decompression of video signal is now expected to be the driving force in the evolution of media processor. The MediaSoC322xA consists of two fully programmable processor cores and integrates digital video encoder. The programmable cores toward a particular class of algorithms: the MediaDSP3200 for RISC/DSP oriented functions and multimedia processing, and the RISC3200 for bit stream processing and control function. Dedicated interface units for DRAM, SDRAM, Flash, SRAM, on screen display and the digital video encoder are connected via a 32-bit system bus with the processor cores. The MediaSoC322xA is fabricated in a 0.18um 6LM standard-cell SMIC CMOS technology, occupies about 20mm<sup>2</sup>, and operates at 180MHz. The MeidaSoC322xA are used to audio/video decoder for embedded multimedia application.
To accelerate media processing, many media enhancement instructions have been adopted into the instruction set of embedded processors. In this paper, a novel method, called interaction between instructions and algorithms (IIA), is proposed to optimize these media enhancement instructions. Based on the analysis for inherent characteristics of video processing algorithms and processor's architecture, three measures are proposed: three single-cycle instructions for manipulation on bit level are implemented to speed up variable-length decoding; a data path is designed to solve data misalignment in SIMD processing instead of software programs; a memory architecture is proposed to support 128-bit word parallel processing. All these suggestions are used in the optimization of an embedded processor, MediaDSP3200 which fuses RISC architecture and DSP computation capability thoroughly and achieves reduced instruction and 64-bit SIMD instruction set with various addressing mode in a unified RISC pipeline stage architecture. Simulation results show that this optimization method can reduce more than 26.4% of clock cycles for VLD, 47.8% for IDCT and 66.8% for MC in real-time processing.
An embedded single media processor named MediaDSP3200 core fabricated in a six-layer metal 0.18um CMOS process which implemented the RISC instruction set, DSP data processing instruction set and single-instruction-multiple-data (SIMD) multimedia-enhanced instruction set is described. MediaDSP3200 fuses RISC architecture and DSP computation capability thoroughly, which achieves RISC fundamental, DSP extended and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified pipeline stage architecture. These characteristics enhance system digital signal processing performance greatly. The test processor can achieve 32x32-bit multiply-accumulate (MAC) of 320 MOPS, with 16x16-bit MAC of 1280MOPS. The test processor dissipates 600mW at 1.8v, 320MHz. Also, the implementation was primarily standard cell logic design style. MediaDSP3200 targets diverse embedded application systems, which need both powerful processing/control capability and low-cost budget, e.g. set-top-boxes, video conferencing, DTV, etc. MediaDSP3200 instruction set architecture, addressing mode, pipeline design, SIMD feature, split-ALU and MAC are described in this paper. Finally, the performance benchmark based on H.264 and MPEG decoder algorithm are given in this paper.
Thin film is very important in many industries. To perform the functions for which they were designed, the films must have proper thickness, roughness and other characteristics. These characteristics must often be measured, both during and after fabrication. Optical methods used to determining the characteristics of films are usually preferred because they are accurate, nondestructive and require little or no sample preparing. This paper introduces a new method of determining the thickness of thin films using Adaptive Simulated Annealing (ASA) algorithm. Based on the theory of thin film calculating, it uses the spectral reflectance data with the incident light perpendicular to the sample surface over a range of wavelengths to calculate the thickness of thin film. ASA is selected as a global optimization algorithm to characterize the thickness of thin film because it is good at dealing with the multimodal and nonsmooth cost function and it can converge quickly and accurately. The thicknesses of four thin film systems are calculated out to testify the correctness and efficiency of the method and the results are satisfying.
Media processing such as real-time compression and decompression of video signal is now expected to be the driving force in the evolution of media processor. In this paper, a hardware and software co-design approach is introduced for a 32-bit media processor: MediaDsp3201 (briefly, MD32), which is realized in 0.18μm TSMC, 200MHz and can achieve 200 million multiply-accumulate (MAC) operations per second. In our design, we have emerged RISC and DSP into one processor (RISC/DSP). Based on the analysis of inherent characteristics of video processing algorithms, media enhancement instructions are adopted into MD32’instruction set. The media extension instructions are physically realized in the processor core, and improves video processing performance effectively with negligible additional hardware cost (2.7%). Considering the high complexity of the operation for media instructions, technology named scalable super pipeline is used to resolve problem of the time delay of pipeline stage (mainly EX stage). Simulation results show that our method can reduce more than 31% and 23% instructions for IDCT compared to MMX and SSE’s implementation and 40% for MC compared to MMX’s implementation.
This paper proposes pipelining and bypassing unit (BPU) design method in our 32-bit RISC/DSP processor: MediaDsp3201 (briefly, MD32). MD32 is realized in 0.18μm technology, 1.8v, 200MHz working clock and can achieve 200 million/s Multiply-Accumulate (MAC) operations. It merges RISC architecture and DSP computation capability thoroughly, achieves fundamental RISC, extended DSP and single instruction multiple data (SIMD) instruction set with various addressing modes in a unified and customized DSP pipeline stage architecture. We will first describe the pipeline structure of MD32, comparing it to typical RISC-style pipeline structure. And then we will study the validity of two bypassing schemes in terms of their effectiveness in resolving pipeline data hazards: Centralized and Distributed BPU design strategy (CBPU and DBPU). A bypassing circuit chain model is given for DBPU, which register read is only placed at ID pipe stage. Considering the processor’s working clock which is decided by the pipeline time delay, the optimization of circuit that serial select with priority is also analyzed in detail since the BPU consists of a long serial path for combination logic. Finally, the performance improvement is analyzed.
RISC and DSP, two main architectures, have their own features. The main idea of RISC is “simple is fast”. Acting as controller, RISC is based on Load/Store structure, register-register Instruction Set Architecture (ISA), general purpose registers and cache. On the other hand, designed for signal processing, DSP emphasizes large data accessing and fast computing. It’s based on register-memory ISA, diverse addressing modes, data address generator, multiplier accumulator and RAM. As Embedded Systems grow fast, no single core architecture, neither RISC nor DSP, could meet the needs anymore. Combination is necessary. There are two kinds of combination: dual-core or single core. Single core means RISC core and DSP core melt into one core with common resource and unified ISA. A 32b media processor named MediaDSP3201 (MD32 for short) is a new member of this family. In this paper, the MD32 design is introduced and concentrated on ISA design and pipeline design. They are important in architecture design. Compatibility runs through the whole design. The ISA should include features from both RISC ISA and DSP ISA. The pipeline should fit the designed ISA as good as possible. MD32 was made by TSMC at the first try on 2004 spring. Application programs running on it show that the design is successfully and the chip is suitable for Embedded System applications.
Turbo codes are now universally known as one of the most effective techniques for achieving performance very close to the Shannon theoretical limits in many transmission systems. This paper presents a speed optimized ASIC turbo decoder core's design. The proposed architectures achieve a complexity reduction. Because of the recursion algorithm, the result of recursion is used immediately in following cycle. A reasonable pipeline is adopted by averaged the critical path to eliminate this effect. Core is fit to realize not only in FPGA, but also can embedded into other DSP and the decode rate can reach 6 <i>Mbps</i> in 0.18 <i>um</i> technology.
Proc. SPIE. 5309, Embedded Processors for Multimedia and Communications
KEYWORDS: Human-machine interfaces, Digital signal processing, Wavelets, Field programmable gate arrays, Computer simulations, Telecommunications, Signal processing, Software development, Multimedia, Data communications
Hardware/software co-simulation is a key step in hardware/software co-design flow. In this paper, a reconfigurable co-simulation platform called MPSP for media processor is described. This platform can be configured on both hardware and software quickly to accommodate different media processor for different simulation specification. The design of co-simulation environment on MPSP is based on library. A reconfigurable IP library and a software pack with API interface are provided as a part of MPSP. Based on this platform, the FPGA based co-simulation processing is greatly accelerated.
With the pressure from the design productivity and various special applications, original design method for DSP can no longer keep up with the required speed. A novel design method is needed urgently. Intellectual Property (IP) reusing is a tendency for DSP design, but simple plug-and-play IP cores approaches almost never work. Therefore, appropriate control strategies are needed to connect all the IP cores used and coordinate the whole DSP. This paper presents a new DSP design procedure, which refers to System-on-a-chip, and later introduces a novel control strategy named DWC to implement the DSP based on IP cores. The most important part of this novel control strategy, pipeline control unit (PCU), is given in detail. Because a great number of data hazards occur in most computation-intensive scientific application, a new effective algorithm of checking data hazards is employed in PCU. Following this strategy, the design of a general or special purposed DSP can be finished in shorter time, and the DSP has a potency to improve performance with little modification on basic function units. This DWC strategy has been implement in a 16-bit fixed-pointed DSP successfully.
KEYWORDS: Signal to noise ratio, Digital signal processing, Data compression, Clocks, Data storage, Field programmable gate arrays, Computer programming, Telecommunications, Signal processing, Wireless communications
Due to their near Shannon-capacity performance, turbo codes have received a considerable amount of attention since their introduction. They are particularly attractive for cellular communication systems and have been included in the specifications for both the WCDMA(UMTS) and cdma2000 third-generation cellular standards. The log-MAP decoding algorithm and some technologies used to reduce the complexity have discussed in the past days. But we can see that if we apply the Turbo code to wireless communications,the decoding process rate is the bottleneck. The software implement is not realistic in today’s DSP process rate. So the hardware design is supposed to realize the decoding. The purpose of this paper is to present a full ASIC design way of Turbo decoding. Many technologies are added to the basic Log-MAP algorithm to simple the design and improve the performance. With the log-MAP algorithm, the Jacobi logarithm is computed exactly using max*()=ln(exp(x)+exp(y))=max()+fc(|y-x|),The correction function fc(|y-x|) is important because there will be 0.5dB SNR loss without it. The linear approximation can be used and the linear parameters was selected carefully to suit hardware realize in our design. In order to save the power consumption and also to assure the performance, the quantization is important in ASIC design, we adopt a compromise scheme to save the power and also there is good BER behaves. Many noisy frames can be corrected with a few iterations while some of the more noisy frames need to experience a full number of iterations (NOI). Thus, if the decoder could stop the iteration as soon as the frame becomes correct, the average NOI would be reduced. There are many ways to stop the iteration such as CRC, compare and so on, we adopt a significantly less computation and much less storage stop criteria. For long frames the memory for storing the entire frame of the forward probability α or the backward probability β can be very large. Available products all use sliding-window version of the turbo decoder to reduce the memory requirements. This is also true in our design. In addition of this, a new method is adopted to expand the sliding window length but without increasing the storing requirement. This method also improves the performance evidently.
The technologies adoped in the paper are suited hardware design for wireless application. For example, this decoding core can be embedded into our 32-bit digital signal processor (MD-32) to realize 3G basestation receiver.
This paper presents a system-level codesign methodology of hardware/software for HDTV source decoder. We give out the hardware/software codesign method, which can meet the requirements of HDTV source decoding application. According to the algorithm development and analysis, the MPEG-2 MPHL decoder process is partitioned into software and hardware at the system level. We construct the system model that can represent the system architecture and application program of interest. A platform-based design environment is developed that can provide modularity, scalability, and flexibility in cosimulation and development for system on chip. Our main contribution is to introduce platform-based codesign tools where the software and hardware are validated and developed concurrently at the cycle accurate level.
This paper describes hardware/software codesign method of the extendible embedded RISC core VIRGO, which based on MIPS-I instruction set architecture. VIRGO is described by Verilog hardware description language that has five-stage pipeline with shared 32-bit cache/memory interface, and it is controlled by distributed control scheme. Every pipeline stage has one small controller, which controls the pipeline stage status and cooperation among the pipeline phase. Since description use high level language and structure is distributed, VIRGO core has highly extension that can meet the requirements of application. We take look at the high-definition television MPEG2 MPHL decoder chip, constructed the hardware/software codesign virtual prototyping machine that can research on VIRGO core instruction set architecture, and system on chip memory size requirements, and system on chip software, etc. We also can evaluate the system on chip design and RISC instruction set based on the virtual prototyping machine platform.
KEYWORDS: Digital signal processing, Lithium, Detection and tracking algorithms, Embedded systems, Nanoimprint lithography, Electronics engineering, Algorithm development, Cerium, Information science, Standards development
A C-compiler is a basic tool for most embedded systems programmers. It is the tool by which the ideas and algorithms in your application (expressed as C source code) are transformed into machine code executable by the target processor. Our research was to develop an optimizing C-compiler for a specified 16-bit DSP. As one of the most important part in the C-compiler, Code Generation's efficiency and performance directly affect to the resultant target assembly code. Thus, in order to improve the performance of the C-compiler, we constructed an efficient code generation based on RTL, an intermediate language used in GNU CC. The code generation accepts RTL as main input, takes good advantage of features specific to RTL and specified DSP's architecture, and generates compact assembly code of the specified DSP. In this paper, firstly, the features of RTL will be briefly introduced. Then, the basic principle of constructing the code generation will be presented in detail. According to the basic principle, this paper will discuss the architecture of the code generation, including: syntax tree construction / reconstruction, basic RTL instruction extraction, behavior description at RTL level, and instruction description at assembly level. The optimization strategies used in the code generation for generating compact assembly code will also be given in this paper. Finally, we will achieve the conclusion that the C-compiler using this special code generation achieved high efficiency we expected.
For the plankton recognition system, we proposed the new generation system which is a parallel high performance DSP system, the reasons are easily to modify the algorithms and to be developed to another application system. We estimate the performance of low level, intermediate level, and high level algorithm based on DSP. This paper focuses on the new architecture concepts of plankton recognition system architecture and some algorithm optimization on the TMS320C6201 DSP.
Proc. SPIE. 4314, Security and Watermarking of Multimedia Contents III
KEYWORDS: Digital signal processing, Safety, Computing systems, Field programmable gate arrays, Control systems, Data processing, Telecommunications, Data communications, Computer architecture, Computer security
There has been a vast increase in the accumulation and communication of digital computer data in both the private and public sectors, much of this information has a significant value and requires protection. Encryption is an effective measure to be used in data security applications. With proper management controls, adequate implementation specifications, and applicable usage guidelines, data encryption no only aid in protection on data communication but can provide protection for a myriad of specific data processing applications. DSP is becoming more and more popular for their fast execution of instruction, wide applicability and relative low cost. Using DSP as processing unit to implement complex data encryption algorithms is a good idea. We developed a DSP based Data Encryption Communication System (D-DECS in abbreviation) to implement real-time data security applications. This paper not only presents a new architecture-- Single-Program and Multiple-Data stream (SPMD in abbreviation) architecture to build D-DECS, but also give out the complete features about hardware and software of the D-DECS we developed. Finally, we achieved a result that the D-DECS based on SPMD architecture has both good performance and nice flexibility to meet various requirements under different situations.
Processor's architecture has great effect on the performance of whole processor array. In order to improve the performance of SIMD array architecture, we modified the structure of BAP (bit-serial array processor) processing element based on the BAP128 processor. The array processor chip of modified bit-serial array processor (MBAP in abbreviation) with 0.35 micrometers CMOS technology is designed for embedded image understanding system. This paper not only presents MBAP architecture, but also gives the architecture feature about this design. Toward basic macro instructions and low-level processing algorithms of image understanding, the performance of BAP and MBAP is compared. The result shows that the performance of MBAP has much improvement on BAP, at the cost of increasing 5% chip resource.