NextBrain - CUDA out of memory error - Freesurfer

7 Nov 2024


      Hi there,
I am playing around with NextBrain with a dataset of individuals with large white matter lesions, and I am noticing that in 10% of the cases NextBrain fails to finish due to a OOM error, such as this:
RuntimeError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 0; 31.73 GiB total capacity; 28.54 GiB already allocated; 1.00 GiB free; 30.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have tried to adjust the max_split_size_mb parameter to circumvent the problem (PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) but it is still failing for some individuals. Do you guys have any inputs about how to deal with this? For context, I am using a cluster with two 32GB GPUs with additional 10 CPUs.
Thanks!
Nárlon
External Email - Use Caution