Tasks will be scheduled in blocks (work units) using a wavefront propagation strategy, therefore allowing sparse scheduling. Because work units has been designed as spatially cohesive, the fast Thread Group Shared Memory can be used and reused through a Gauss-Seidel like acceleration. The work unit partitioning scheme will however vary on odd- and even-numbered iterations to reduce convergence barriers. Synchronization will be ensured by an 8-step 3D variant of the traditional Red Black Ordering scheme. An attack model and early termination will also be described and implemented as additional acceleration techniques.
Using our hybrid framework and typical operating parameters, we were able to compute the superpixels of a high-resolution 512x512x512 aortic angioCT scan in 283 ms using a AMD R9 290X GPU. We achieved a 22.3X speed-up factor compared to the published reference GPU implementation.