Hi David,
I'm surprised you didn't have to do this in the past. We always space our jobs out. Glad there's an easy workaround
cheers Bruce
On Fri, 3 Sep 2010, David Mischel wrote:
We took the suggestion of staggering the launch of Freesurfer 5.0 recon-all jobs. The attached Word doc (I don't know how to contribute this information other than attaching the image and text using Word) shows a load graph on our file server. When 20 FS jobs began at once (all processing servers using a single file server) the load on the file server bulged up. When we spaced out the launch of each job by 15 seconds the load hardly budged.
We have not had to do this in the past with earlier versions of Freesurfer but this is an obvious work around to the problem we encountered.
< david
David Mischel
Manager of IT
Center for Imaging of Neurodegenerative Diseases (CIND)
http://www.cind.research.va.gov/ http://www.cind.research.va.gov/
VA Medical Center
4150 Clement Street, 114M
San Francisco, CA 94121
voice: 415-221-4810 x3864
fax: 415-668-2864
Hi Bruce and others, My question is related to the staggering of FS jobs (see thread below). Could you tell me by how much time you stagger the freesurfer on a cluster ?
I am using FS 5.1 on a cluster which uses sun grid engine and I have the following error when I try to submit a large number of jobs ..I do not see it for small job batches and I think it might be related to staggering FS jobs but I am not sure.
the error message from recon-all.log:
*nu_estimate_np_and_em: crashed while running spline_smooth (termination status=11) nu_correct: crashed while running nu_estimate_np_and_em (termination status=65280) * the message in recon-all.error
*PWD /work/01523/msampat/**freesurfer-5.1/subjects/**ms0880_01/mri CMD mri_nu_correct.mni --i orig.mgz --o nu.mgz --uchar transforms/talairach.xfm --proto-iters 1000 --distance 50 --n 1 * First i thought, it was a memory issue but the sun grid engine is supposed to assign 4gb per core, so i thought it was enough memory to run each case. I am investigating if the memory is not allocated correctly..
If anyone knows what this error is related to, could you please let me know ? Thanks Mehul
On Fri, Sep 3, 2010 at 10:25 AM, Bruce Fischl fischl@nmr.mgh.harvard.eduwrote:
Hi David,
I'm surprised you didn't have to do this in the past. We always space our jobs out. Glad there's an easy workaround
cheers Bruce
On Fri, 3 Sep 2010, David Mischel wrote:
We took the suggestion of staggering the launch of Freesurfer 5.0
recon-all
jobs. The attached Word doc (I don't know how to contribute this
information
other than attaching the image and text using Word) shows a load graph on our file server. When 20 FS jobs began at once (all processing servers
using
a single file server) the load on the file server bulged up. When we
spaced
out the launch of each job by 15 seconds the load hardly budged.
We have not had to do this in the past with earlier versions of
Freesurfer
but this is an obvious work around to the problem we encountered.
< david
David Mischel
Manager of IT
Center for Imaging of Neurodegenerative Diseases (CIND)
http://www.cind.research.va.gov/ http://www.cind.research.va.gov/
VA Medical Center
4150 Clement Street, 114M
San Francisco, CA 94121
voice: 415-221-4810 x3864
fax: 415-668-2864
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Hi Mehul,
we don't sorry. We are trying to replicate it here. The fact that it's in an MNI tool and not one of our's makes it harder to track down.
The delay between jobs for us is driven by how robust and rapid your storage is. Certainly a minute or two should be enough I would think, altough we launch much more rapidly than that typically
Bruce
On Thu, 26 May 2011, Mehul Sampat wrote:
Hi Bruce and others, My question is related to the staggering of FS jobs (see thread below). Could you tell me by how much time you stagger the freesurfer on a cluster ?
I am using FS 5.1 on a cluster which uses sun grid engine and I have the following error when I try to submit a large number of jobs ..I do not see it for small job batches and I think it might be related to staggering FS jobs but I am not sure.
the error message from recon-all.log:
*nu_estimate_np_and_em: crashed while running spline_smooth (termination status=11) nu_correct: crashed while running nu_estimate_np_and_em (termination status=65280)
the message in recon-all.error
*PWD /work/01523/msampat/**freesurfer-5.1/subjects/**ms0880_01/mri CMD mri_nu_correct.mni --i orig.mgz --o nu.mgz --uchar transforms/talairach.xfm --proto-iters 1000 --distance 50 --n 1
First i thought, it was a memory issue but the sun grid engine is supposed to assign 4gb per core, so i thought it was enough memory to run each case. I am investigating if the memory is not allocated correctly..
If anyone knows what this error is related to, could you please let me know ? Thanks Mehul
On Fri, Sep 3, 2010 at 10:25 AM, Bruce Fischl fischl@nmr.mgh.harvard.eduwrote:
Hi David,
I'm surprised you didn't have to do this in the past. We always space our jobs out. Glad there's an easy workaround
cheers Bruce
On Fri, 3 Sep 2010, David Mischel wrote:
We took the suggestion of staggering the launch of Freesurfer 5.0
recon-all
jobs. The attached Word doc (I don't know how to contribute this
information
other than attaching the image and text using Word) shows a load graph on our file server. When 20 FS jobs began at once (all processing servers
using
a single file server) the load on the file server bulged up. When we
spaced
out the launch of each job by 15 seconds the load hardly budged.
We have not had to do this in the past with earlier versions of
Freesurfer
but this is an obvious work around to the problem we encountered.
< david
David Mischel
Manager of IT
Center for Imaging of Neurodegenerative Diseases (CIND)
http://www.cind.research.va.gov/ http://www.cind.research.va.gov/
VA Medical Center
4150 Clement Street, 114M
San Francisco, CA 94121
voice: 415-221-4810 x3864
fax: 415-668-2864
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Thanks for letting me know.
On my side, my first guess is for some reason memory is not being allocated correctly by the the script I use (this mostly uses standard sun grid engine flags) I am trying to verify this with sysadmins.
Just fyi. I am using the cluster on www.teragrid.org and one of the sysadmins there looked at the error and he thinks the "MCSRCH routine" is the cause of this error. I don't know what that is; but he suggested the following action:
"I found this reply from the developer of the routine that fails (MCSRCH) regarding this very same error: http://code.google.com/p/mitlm/issues/detail?id=10 Can you try that and see if it works?"
I will let you know if we find a solution. Thanks Mehul
On Thu, May 26, 2011 at 1:11 PM, Bruce Fischl fischl@nmr.mgh.harvard.eduwrote:
Hi Mehul,
we don't sorry. We are trying to replicate it here. The fact that it's in an MNI tool and not one of our's makes it harder to track down.
The delay between jobs for us is driven by how robust and rapid your storage is. Certainly a minute or two should be enough I would think, altough we launch much more rapidly than that typically
Bruce
On Thu, 26 May 2011, Mehul Sampat wrote:
Hi Bruce and others,
My question is related to the staggering of FS jobs (see thread below). Could you tell me by how much time you stagger the freesurfer on a cluster ?
I am using FS 5.1 on a cluster which uses sun grid engine and I have the following error when I try to submit a large number of jobs ..I do not see it for small job batches and I think it might be related to staggering FS jobs but I am not sure.
the error message from recon-all.log:
*nu_estimate_np_and_em: crashed while running spline_smooth (termination status=11) nu_correct: crashed while running nu_estimate_np_and_em (termination status=65280)
the message in recon-all.error
*PWD /work/01523/msampat/**freesurfer-5.1/subjects/**ms0880_01/mri CMD mri_nu_correct.mni --i orig.mgz --o nu.mgz --uchar transforms/talairach.xfm --proto-iters 1000 --distance 50 --n 1
First i thought, it was a memory issue but the sun grid engine is supposed to assign 4gb per core, so i thought it was enough memory to run each case. I am investigating if the memory is not allocated correctly..
If anyone knows what this error is related to, could you please let me know ? Thanks Mehul
On Fri, Sep 3, 2010 at 10:25 AM, Bruce Fischl <fischl@nmr.mgh.harvard.edu
wrote:
Hi David,
I'm surprised you didn't have to do this in the past. We always space our jobs out. Glad there's an easy workaround
cheers Bruce
On Fri, 3 Sep 2010, David Mischel wrote:
We took the suggestion of staggering the launch of Freesurfer 5.0
recon-all
jobs. The attached Word doc (I don't know how to contribute this
information
other than attaching the image and text using Word) shows a load graph on our file server. When 20 FS jobs began at once (all processing servers
using
a single file server) the load on the file server bulged up. When we
spaced
out the launch of each job by 15 seconds the load hardly budged.
We have not had to do this in the past with earlier versions of
Freesurfer
but this is an obvious work around to the problem we encountered.
< david
David Mischel
Manager of IT
Center for Imaging of Neurodegenerative Diseases (CIND)
http://www.cind.research.va.gov/ http://www.cind.research.va.gov/
VA Medical Center
4150 Clement Street, 114M
San Francisco, CA 94121
voice: 415-221-4810 x3864
fax: 415-668-2864
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Mehul,
to get our discussion back on the list (as inevitably other cluster users will probably bump into this same problem), the solution another user found was to modify the MNI N3 source in the following manner, quoted within bars:
======== .... two modifications to the N3 distribution.
first, I had to add the -nocache argument to the calls to volume_stats within nu_estimate_np_and_em.
second, I had to comment out some cache-related code from splineSmooth.cc and evaluateField.cc
//set_n_bytes_cache_threshold(0); // Always cache volume //set_cache_block_sizes_hint(SLICE_ACCESS); // Cache volume slices //set_default_max_bytes_in_cache(0); // keep only one slice at a time
There is clearly something that is very broken about the way that the N3 code does its caching, at least in the parallel context. Unfortunately, turning off caching has really slowed performance - the fastest autorecon1 job (on a single mprage) completed in about 90 minutes, and about 1/4 of the ~750 jobs are still running 4 hours later. But at least they run! As we move forward we will also be testing the subsequent autorecon steps, and I will let you know if we run into any additional things that break on the cluster. =============
a hacky alternative might be to modify recon-all to implement a simple file locking scheme where only one instance of recon-all is allowed to run the nu_correct stage based on checking for a file lock that all instances of recon-all can access. this might be easy to implement and it would still benefit from the existing caching (since nu_correct takes about 2 -3 minutes to run per instance, it wouldnt really hold-up much).
n.
On Thu, 2011-05-26 at 13:24 -0700, Mehul Sampat wrote:
Thanks for letting me know.
On my side, my first guess is for some reason memory is not being allocated correctly by the the script I use (this mostly uses standard sun grid engine flags) I am trying to verify this with sysadmins.
Just fyi. I am using the cluster on www.teragrid.org and one of the sysadmins there looked at the error and he thinks the "MCSRCH routine" is the cause of this error. I don't know what that is; but he suggested the following action:
"I found this reply from the developer of the routine that fails (MCSRCH) regarding this very same error: http://code.google.com/p/mitlm/issues/detail?id=10 Can you try that and see if it works?"
I will let you know if we find a solution. Thanks Mehul
On Thu, May 26, 2011 at 1:11 PM, Bruce Fischl fischl@nmr.mgh.harvard.edu wrote: Hi Mehul,
we don't sorry. We are trying to replicate it here. The fact that it's in an MNI tool and not one of our's makes it harder to track down. The delay between jobs for us is driven by how robust and rapid your storage is. Certainly a minute or two should be enough I would think, altough we launch much more rapidly than that typically Bruce On Thu, 26 May 2011, Mehul Sampat wrote: Hi Bruce and others, My question is related to the staggering of FS jobs (see thread below). Could you tell me by how much time you stagger the freesurfer on a cluster ? I am using FS 5.1 on a cluster which uses sun grid engine and I have the following error when I try to submit a large number of jobs ..I do not see it for small job batches and I think it might be related to staggering FS jobs but I am not sure. the error message from recon-all.log: *nu_estimate_np_and_em: crashed while running spline_smooth (termination status=11) nu_correct: crashed while running nu_estimate_np_and_em (termination status=65280) * the message in recon-all.error *PWD /work/01523/msampat/**freesurfer-5.1/subjects/**ms0880_01/mri CMD mri_nu_correct.mni --i orig.mgz --o nu.mgz --uchar transforms/talairach.xfm --proto-iters 1000 --distance 50 --n 1 * First i thought, it was a memory issue but the sun grid engine is supposed to assign 4gb per core, so i thought it was enough memory to run each case. I am investigating if the memory is not allocated correctly.. If anyone knows what this error is related to, could you please let me know ? Thanks Mehul On Fri, Sep 3, 2010 at 10:25 AM, Bruce Fischl <fischl@nmr.mgh.harvard.edu>wrote: Hi David, I'm surprised you didn't have to do this in the past. We always space our jobs out. Glad there's an easy workaround cheers Bruce On Fri, 3 Sep 2010, David Mischel wrote: We took the suggestion of staggering the launch of Freesurfer 5.0 recon-all jobs. The attached Word doc (I don't know how to contribute this information other than attaching the image and text using Word) shows a load graph on our file server. When 20 FS jobs began at once (all processing servers using a single file server) the load on the file server bulged up. When we spaced out the launch of each job by 15 seconds the load hardly budged. We have not had to do this in the past with earlier versions of Freesurfer but this is an obvious work around to the problem we encountered. < david David Mischel Manager of IT Center for Imaging of Neurodegenerative Diseases (CIND) <http://www.cind.research.va.gov/> http://www.cind.research.va.gov/ VA Medical Center 4150 Clement Street, 114M San Francisco, CA 94121 voice: 415-221-4810 x3864 fax: 415-668-2864 _______________________________________________ Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
freesurfer@nmr.mgh.harvard.edu