On Sat, Mar 3, 2012 at 6:57 PM, Abdulkadir Ahmed ahmed.abdulkadir@epfl.ch wrote:
Freesurfer supports GPU acceleration since version 5.0. To assess the utility of this functionality, particularly the gain in performance, I performed `recon-all` with and without the option `-use-gpu`. I'd like to share the result and also hope to get some answers to the question that came up. The Video card used was a GeForce GTX 460 and the CPU was an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
- Running mri_em_register_cuda was about 4 times faster than mri_em_register. During execution of mri_em_register_cuda, the GPU load was at 20% as indicated by `nvidia-smi -a`
If you look at the source, there are options in the mri_em_register source files which can use a much faster GPU algorithm. Unfortunately, this does not always converge to the same result (although I have no reason to think it's a 'worse' answer).
- Running `recon-all` with `-use-gpu` option took 6:49 hours and without the option it took about 20 Minutes longer. A summary of the processing steps taken from `recon-all-status.log` is attached to this email. Note, that these times represent wall clock times, not CPU times.
While Amdahl's Law is painful, that's a surprisingly small advantage :-(
Hopefully this information will be helpful. The processes "EM Registration" and "CA Normalize" were significantly accelerated by the GPU, as expected. Also "SubCort Seg" ran much faster. "Surf Reg rh" ran 3.4 times faster with GPU acceleration while the accelerated version of "Surf Reg lh" was slower than the regular. What binary was used in these steps? What could be the reason for the difference between hemispheres (I assume lh/rh means left/right hemisphere)? Bug? Problem with the data?
I'm surprised mri_ca_register wasn't significantly accelerated as well - on a Fermi card such as yours, it can be a lot faster too (I was getting 20 minutes cf 2 hours on the CPU). I can't remember if there's an extra switch you need to enable that... Nick?
The GPU performance was at 20% while CUDA binaries were executed. What could be the reason for this, i.e. why not 100%? Is this even expected on the GeForce GTX 460? What are the limitations when running multiple GPU-accelerated instances of `recon-all`? How many GPU memory does a single instance take at max?
I don't know how that measurement is made, but there are certainly some possible reasons: 1) The GPU code is certainly far from optimal. Given the extremely painful CPU<->GPU transfers, there was little point in spending lots of time optimising individual routines. And getting very high percentages of theoretical performance is painful on both the CPU and GPU (for example, unless you're writing SSE, then you're probably not going to get better than 25% (single) or 50% (double) of CPU peak). A colleague of mine got to about 70-80% of theoretical max on a GPU on a perfectly parallelisable computation, and that was big news. I doubt that any of the computations performed by the Freesurfer kernels are anywhere close to being so ideally suited to parallel execution (before you worry about my sub-optimal implementations) 2) If it's an average over the entire program, then there is still a significant amount of processing which is done on the CPU (plus all the transfers).
I never tried running multiple recons at once, to timeslice the GPU. Provided your card has enough memory and it's not set to Compute-Exclusive mode, it should be possible.
What could be the reasons that `cudadetect` failed to detect any CUDA devices, yet CUDA binaries worked as expected?
I'll defer to Nick on that (as well as any Fermi-enabling options).
Regards,
Richard