Dear SPM community,
Freesurfer supports GPU acceleration since version 5.0. To assess the utility of this functionality, particularly the gain in performance, I performed `recon-all` with and without the option `-use-gpu`. I'd like to share the result and also hope to get some answers to the question that came up. The Video card used was a GeForce GTX 460 and the CPU was an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
1) cudadetect did not recognize the video card Detecting CUDA... *** No CUDA enabled device(s) detected! ***
2) mri_em_register_cuda called without parameters gave following output nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2010 NVIDIA Corporation Built on Thu_Nov__4_12:44:17_PDT_2010 Cuda compilation tools, release 3.2, V0.2.1221
Driver : 3.20 Runtime : 3.20
Acquiring CUDA device Using default device CUDA device: GeForce GTX 460 ...
3) Running mri_em_register_cuda was about 4 times faster than mri_em_register. During execution of mri_em_register_cuda, the GPU load was at 20% as indicated by `nvidia-smi -a`
4) Running `recon-all` with `-use-gpu` option took 6:49 hours and without the option it took about 20 Minutes longer. A summary of the processing steps taken from `recon-all-status.log` is attached to this email. Note, that these times represent wall clock times, not CPU times.
Hopefully this information will be helpful. The processes "EM Registration" and "CA Normalize" were significantly accelerated by the GPU, as expected. Also "SubCort Seg" ran much faster. "Surf Reg rh" ran 3.4 times faster with GPU acceleration while the accelerated version of "Surf Reg lh" was slower than the regular. What binary was used in these steps? What could be the reason for the difference between hemispheres (I assume lh/rh means left/right hemisphere)? Bug? Problem with the data?
According to `-recon-all`, following binaries support GPU acceleration mri_em_register, mri_ca_register, mris_inflate and mris_sphere. Cold someone point out what binaries correspond to the 64 steps reported in the "scripts/recon-all-status.log"? This would enable to identify the steps that have GPU-accelerated variants.
The GPU performance was at 20% while CUDA binaries were executed. What could be the reason for this, i.e. why not 100%? Is this even expected on the GeForce GTX 460? What are the limitations when running multiple GPU-accelerated instances of `recon-all`? How many GPU memory does a single instance take at max?
What could be the reasons that `cudadetect` failed to detect any CUDA devices, yet CUDA binaries worked as expected?
Best regards, Ahmed Abdulkadir -- Master Student, Life Sciences, Semester 4 Medical Image Processing (MIP) Lab École Polytechnique Fédérale de Lausanne
Student Assistant Functional Brain Imaging (FBI) Department of Neurology University Medical Center Freiburg Breisacher Str. 64, D-79106 Freiburg --
On Sat, Mar 3, 2012 at 6:57 PM, Abdulkadir Ahmed ahmed.abdulkadir@epfl.ch wrote:
Freesurfer supports GPU acceleration since version 5.0. To assess the utility of this functionality, particularly the gain in performance, I performed `recon-all` with and without the option `-use-gpu`. I'd like to share the result and also hope to get some answers to the question that came up. The Video card used was a GeForce GTX 460 and the CPU was an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
- Running mri_em_register_cuda was about 4 times faster than mri_em_register. During execution of mri_em_register_cuda, the GPU load was at 20% as indicated by `nvidia-smi -a`
If you look at the source, there are options in the mri_em_register source files which can use a much faster GPU algorithm. Unfortunately, this does not always converge to the same result (although I have no reason to think it's a 'worse' answer).
- Running `recon-all` with `-use-gpu` option took 6:49 hours and without the option it took about 20 Minutes longer. A summary of the processing steps taken from `recon-all-status.log` is attached to this email. Note, that these times represent wall clock times, not CPU times.
While Amdahl's Law is painful, that's a surprisingly small advantage :-(
Hopefully this information will be helpful. The processes "EM Registration" and "CA Normalize" were significantly accelerated by the GPU, as expected. Also "SubCort Seg" ran much faster. "Surf Reg rh" ran 3.4 times faster with GPU acceleration while the accelerated version of "Surf Reg lh" was slower than the regular. What binary was used in these steps? What could be the reason for the difference between hemispheres (I assume lh/rh means left/right hemisphere)? Bug? Problem with the data?
I'm surprised mri_ca_register wasn't significantly accelerated as well - on a Fermi card such as yours, it can be a lot faster too (I was getting 20 minutes cf 2 hours on the CPU). I can't remember if there's an extra switch you need to enable that... Nick?
The GPU performance was at 20% while CUDA binaries were executed. What could be the reason for this, i.e. why not 100%? Is this even expected on the GeForce GTX 460? What are the limitations when running multiple GPU-accelerated instances of `recon-all`? How many GPU memory does a single instance take at max?
I don't know how that measurement is made, but there are certainly some possible reasons: 1) The GPU code is certainly far from optimal. Given the extremely painful CPU<->GPU transfers, there was little point in spending lots of time optimising individual routines. And getting very high percentages of theoretical performance is painful on both the CPU and GPU (for example, unless you're writing SSE, then you're probably not going to get better than 25% (single) or 50% (double) of CPU peak). A colleague of mine got to about 70-80% of theoretical max on a GPU on a perfectly parallelisable computation, and that was big news. I doubt that any of the computations performed by the Freesurfer kernels are anywhere close to being so ideally suited to parallel execution (before you worry about my sub-optimal implementations) 2) If it's an average over the entire program, then there is still a significant amount of processing which is done on the CPU (plus all the transfers).
I never tried running multiple recons at once, to timeslice the GPU. Provided your card has enough memory and it's not set to Compute-Exclusive mode, it should be possible.
What could be the reasons that `cudadetect` failed to detect any CUDA devices, yet CUDA binaries worked as expected?
I'll defer to Nick on that (as well as any Fermi-enabling options).
Regards,
Richard
my comments inserted below...
On Sat, 2012-03-03 at 20:08 -0500, R Edgar wrote:
On Sat, Mar 3, 2012 at 6:57 PM, Abdulkadir Ahmed ahmed.abdulkadir@epfl.ch wrote:
Freesurfer supports GPU acceleration since version 5.0. To assess the utility of this functionality, particularly the gain in performance, I performed `recon-all` with and without the option `-use-gpu`. I'd like to share the result and also hope to get some answers to the question that came up. The Video card used was a GeForce GTX 460 and the CPU was an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
- Running mri_em_register_cuda was about 4 times faster than mri_em_register. During execution of mri_em_register_cuda, the GPU load was at 20% as indicated by `nvidia-smi -a`
If you look at the source, there are options in the mri_em_register source files which can use a much faster GPU algorithm. Unfortunately, this does not always converge to the same result (although I have no reason to think it's a 'worse' answer).
- Running `recon-all` with `-use-gpu` option took 6:49 hours and without the option it took about 20 Minutes longer. A summary of the processing steps taken from `recon-all-status.log` is attached to this email. Note, that these times represent wall clock times, not CPU times.
While Amdahl's Law is painful, that's a surprisingly small advantage :-(
this time difference doesnt seem right. in looking at your attached file recon-all-status-comparison.csv, i see these two lines:
#@# CA Reg ,00:01:15,00:01:16,1 #@# SubCort Seg ,00:03:23,00:09:36,2.8
the times listed do not look correct. the CA Reg step, which runs mri_ca_register, usually takes hours, whereas SubCort Seg, which runs mri_ca_label, takes minutes. here the times show an order of magnitude different. how were these times extracted? an option is to use the -time flag with recon-all, which will create timestamps with the log which can be extracted with grep time:: to show times like this:
time:: 13:09:55 elapsed:: 47395 cmd:: mri_ca_register time:: 00:02:13 elapsed:: 00133 cmd:: mri_ca_register time:: 00:02:26 elapsed:: 00146 cmd:: mri_remove_neck time:: 00:39:09 elapsed:: 02349 cmd:: mri_em_register time:: 00:43:55 elapsed:: 02635 cmd:: mri_ca_label
where you can use elapsed to get an accurate measure.
btw, the binaries that use cuda have _cuda in their name, so you can grep the log to see which ones are using cuda.
Hopefully this information will be helpful. The processes "EM Registration" and "CA Normalize" were significantly accelerated by the GPU, as expected. Also "SubCort Seg" ran much faster. "Surf Reg rh" ran 3.4 times faster with GPU acceleration while the accelerated version of "Surf Reg lh" was slower than the regular. What binary was used in these steps? What could be the reason for the difference between hemispheres (I assume lh/rh means left/right hemisphere)? Bug? Problem with the data?
I'm surprised mri_ca_register wasn't significantly accelerated as well
- on a Fermi card such as yours, it can be a lot faster too (I was
getting 20 minutes cf 2 hours on the CPU). I can't remember if there's an extra switch you need to enable that... Nick?
the fermi version of the mri_ca_register is not yet publicly available.
The GPU performance was at 20% while CUDA binaries were executed. What could be the reason for this, i.e. why not 100%? Is this even expected on the GeForce GTX 460? What are the limitations when running multiple GPU-accelerated instances of `recon-all`? How many GPU memory does a single instance take at max?
I don't know how that measurement is made, but there are certainly some possible reasons:
- The GPU code is certainly far from optimal. Given the extremely
painful CPU<->GPU transfers, there was little point in spending lots of time optimising individual routines. And getting very high percentages of theoretical performance is painful on both the CPU and GPU (for example, unless you're writing SSE, then you're probably not going to get better than 25% (single) or 50% (double) of CPU peak). A colleague of mine got to about 70-80% of theoretical max on a GPU on a perfectly parallelisable computation, and that was big news. I doubt that any of the computations performed by the Freesurfer kernels are anywhere close to being so ideally suited to parallel execution (before you worry about my sub-optimal implementations) 2) If it's an average over the entire program, then there is still a significant amount of processing which is done on the CPU (plus all the transfers).
I never tried running multiple recons at once, to timeslice the GPU. Provided your card has enough memory and it's not set to Compute-Exclusive mode, it should be possible.
What could be the reasons that `cudadetect` failed to detect any CUDA devices, yet CUDA binaries worked as expected?
not sure. cudadetect is a script which first calls source $FREESURFER_HOME/bin/cuda_setup to point LD_LIBRARY_PATH at cuda, so maybe its somehow failing that.
I'll defer to Nick on that (as well as any Fermi-enabling options).
Regards,
Richard
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
freesurfer@nmr.mgh.harvard.edu