In the medical ultrasound imaging, the synthetic transmit aperture (STA) technique is very promising and has been a hot research topic. It is dynamically focused in both transmit and receive yielding an improvement in resolution. But this imaging technique sets high demands on processing capabilities and makes implementation of a full STA system very challenging and costly. Many attempts have been made to reduce the demands on the system making it a more realistic task to implement. In this paper we don’t consider how to reduce the demands, but consider how to accelerate the processing speed of the system. The recent introduction of general-purpose graphic processing units (GPU) seems to be quite promising in this view, especially for the affordable programming complexity. In this paper we explain the main computational features of STA processing unit, trying to disclose the degree of parallelism in the operations. On the basis of the compute unified device architecture (CUDA) programming model and the extremely flexible structure of the Single Instruction Multiple Threads (SIMT) model, we show that the optimization of STA processing unit can be performed more efficiently. The input data is read from Matlab, the post-processing and display also use Matlab. Performance shows that, using a single NIVDIA GTX-650 GPU board, this amount to a speed up of more than a factor of 30 compared to a highly optimized beamformer running on our test workstation with a 3.20-GHz Intel Core-i5 processor.