Hi there,

I am playing around with NextBrain with a dataset of individuals with large white matter lesions, and I am noticing that in 10% of the cases NextBrain fails to finish due to a OOM error, such as this:

RuntimeError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 0; 31.73 GiB total capacity; 28.54 GiB already allocated; 1.00 GiB free; 30.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried to adjust the max_split_size_mb parameter to circumvent the problem (PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) but it is still failing for some individuals. Do you guys have any inputs about how to deal with this? For context, I am using a cluster with two 32GB GPUs with additional 10 CPUs.

Thanks!

Nárlon