Re: [Freesurfer] Experiences with GPU-assisted recon-all

3 Mar 2012

      On Sat, Mar 3, 2012 at 6:57 PM, Abdulkadir Ahmed
ahmed.abdulkadir@epfl.ch wrote:
...
Freesurfer supports GPU acceleration since version 5.0. To assess the utility of this functionality, particularly the gain in performance, I performed `recon-all` with and without the option `-use-gpu`. I'd like to share the result and also hope to get some answers to the question that came up. The Video card used was a GeForce GTX 460 and the CPU was an Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz.
...

Running mri_em_register_cuda was about 4 times faster than mri_em_register. During execution of mri_em_register_cuda, the GPU load was at 20% as indicated by `nvidia-smi -a`

If you look at the source, there are options in the mri_em_register
source files which can use a much faster GPU algorithm. Unfortunately,
this does not always converge to the same result (although I have no
reason to think it's a 'worse' answer).
...

Running `recon-all` with `-use-gpu` option took 6:49 hours and without the option it took about 20 Minutes longer. A summary of the processing steps taken from `recon-all-status.log` is attached to this email. Note, that these times represent wall clock times, not CPU times.

While Amdahl's Law is painful, that's a surprisingly small advantage :-(
...
Hopefully this information will be helpful. The processes "EM Registration" and "CA Normalize" were significantly accelerated by the GPU, as expected. Also "SubCort Seg" ran much faster. "Surf Reg rh" ran 3.4 times faster with GPU acceleration while the accelerated version of "Surf Reg lh" was slower than the regular. What binary was used in these steps? What could be the reason for the difference between hemispheres (I assume lh/rh means left/right hemisphere)? Bug? Problem with the data?
I'm surprised mri_ca_register wasn't significantly accelerated as well
- on a Fermi card such as yours, it can be a lot faster too (I was
getting 20 minutes cf 2 hours on the CPU). I can't remember if there's
an extra switch you need to enable that... Nick?
...
The GPU performance was at 20% while CUDA binaries were executed. What could be the reason for this, i.e. why not 100%? Is this even expected on the GeForce GTX 460? What are the limitations when running multiple GPU-accelerated instances of `recon-all`? How many GPU memory does a single instance take at max?
I don't know how that measurement is made, but there are certainly
some possible reasons:
1) The GPU code is certainly far from optimal. Given the extremely
painful CPU<->GPU transfers, there was little point in spending lots
of time optimising individual routines. And getting very high
percentages of theoretical performance is painful on both the CPU and
GPU (for example, unless you're writing SSE, then you're probably not
going to get better than 25% (single) or 50% (double) of CPU peak). A
colleague of mine got to about 70-80% of theoretical max on a GPU on a
perfectly parallelisable computation, and that was big news. I doubt
that any of the computations performed by the Freesurfer kernels are
anywhere close to being so ideally suited to parallel execution
(before you worry about my sub-optimal implementations)
2) If it's an average over the entire program, then there is still a
significant amount of processing which is done on the CPU (plus all
the transfers).
I never tried running multiple recons at once, to timeslice the GPU.
Provided your card has enough memory and it's not set to
Compute-Exclusive mode, it should be possible.
...
What could be the reasons that `cudadetect` failed to detect any CUDA devices, yet CUDA binaries worked as expected?
I'll defer to Nick on that (as well as any Fermi-enabling options).
Regards,
Richard

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

Re: [Freesurfer] Experiences with GPU-assisted recon-all