I am playing around with NextBrain with a dataset of individuals with large white matter lesions, and I am noticing that in 10% of the cases NextBrain fails to finish due to a OOM error, such as this:
I have tried to adjust the max_split_size_mb parameter to circumvent the problem (PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) but it is still failing for some individuals. Do you guys have any inputs about how to deal with this? For context, I am using a cluster with two 32GB GPUs with additional 10 CPUs.