Convolution filtering is one of the most important algorithms in image processing. It is data-intensive, especially when dealing with high-definition images. Most previous studies on accelerating convolution computation in parallel focus on the use of graphics processing units (GPUs), whereas the central processing units (CPUs) always play the role of host to manage the data buffer and control flow. However, recent CPU architectures have seen significant modifications to parallel data computing capabilities, and the trend of integrating the CPU and GPU on a single chip is on a rise. We propose an approach to accelerate convolution filtering on the heterogeneous architecture of integrated CPU–GPU. We exploit the parallel processing power of vector instructions on a CPU and make it collaboratively function with the on-chip GPU. Two task assignment methods, static and dynamic task partitioning, are proposed for CPU–GPU collaboration. We evaluate our approach with images and filters of different sizes. The experimental results demonstrate that we can achieve 146 GFLOP/s at best using a quad-core CPU and the performance is 2.5 to 4.8 times faster than that of the single-GPU version of the OpenCV library. We also obtain 90 times speedup over the single-threaded CPU version. The results demonstrate that the proposed algorithm is efficient.