The cudaMagW kernels are partially optimized but at least a few register variables can be eliminated. The Y and Z kernels only require 2 of the 3 components present in dims which could eliminate one more register.
If a third tile of shared memory is not too much [check with nvidia's occupancy calculator spreadsheet] at least one __syncthreads() can be eliminated from the YZ kernels. Attempts to rewrite the algorithm to eliminate conditions using exact math ops (x+0, x-x and x*1 evaluate exactly in ieee754) are encouraged.
The cudaMagW kernels are partially optimized but at least a few register variables can be eliminated. The Y and Z kernels only require 2 of the 3 components present in dims which could eliminate one more register.
If a third tile of shared memory is not too much [check with nvidia's occupancy calculator spreadsheet] at least one __syncthreads() can be eliminated from the YZ kernels. Attempts to rewrite the algorithm to eliminate conditions using exact math ops (x+0, x-x and x*1 evaluate exactly in ieee754) are encouraged.