?Hello,
I'm emailing for assistance regarding 6 hard-failed FS subjects (out of 1000) that were re-processed and still prematurely terminated at different pre-processing stages (using FS 7.1). The job was completed on a SLURM computing cluster affiliated with Boston Children's Hospital's E2 server, and I've attached the recon-all.logs for clarity, although none of the logs report an "exit with errors" but rather simply stop running.
Do you have any idea what the issue(s) might be? Hopefully this is enough information to work with, as I'm not the one who actually ran the commands and am a new RA trying to get acclimated to the computational environment at BCH. Thanks for the help.
Best,
Peter
Not sure. One thing that is suspicious is that they all died within minutes of each other. That, and because there is no actual error message, makes me think that something went wrong on the computer around 7:20AM on June 22. How many CPUs do you have on caladan? It only has 250G of memory, so I'm guessing 8. The load at the time of crash was around 5-7, so you might be near the edge. If you re-run them, do they die at the same place?
On 6/30/2020 4:59 PM, McManus, Peter wrote:
Hello,
I'm emailing for assistance regarding 6 hard-failed FS subjects (out of 1000) that were re-processed and still prematurely terminated at different pre-processing stages (using FS 7.1). The job was completed on a SLURM computing cluster affiliated with Boston Children's Hospital's E2 server, and I've attached the recon-all.logs for clarity, although none of the logs report an "exit with errors" but rather simply stop running.
Do you have any idea what the issue(s) might be? Hopefully this is enough information to work with, as I'm not the one who actually ran the commands and am a new RA trying to get acclimated to the computational environment at BCH. Thanks for the help.
Best,
Peter
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
?Hi Doug,
Thanks for the response and apologies on the late reply. Yes, Caladan is running on 8 CPU's and re-running them in isolation remedied the problem and they were able to finish successfully. Thanks for the help!
Peter
________________________________ From: freesurfer-bounces@nmr.mgh.harvard.edu freesurfer-bounces@nmr.mgh.harvard.edu on behalf of Douglas N. Greve dgreve@mgh.harvard.edu Sent: Tuesday, June 30, 2020 6:39 PM To: freesurfer@nmr.mgh.harvard.edu Subject: Re: [Freesurfer] Fw: Hard Error Troubleshooting [EXTERNAL]
* External Email - Caution *
Not sure. One thing that is suspicious is that they all died within minutes of each other. That, and because there is no actual error message, makes me think that something went wrong on the computer around 7:20AM on June 22. How many CPUs do you have on caladan? It only has 250G of memory, so I'm guessing 8. The load at the time of crash was around 5-7, so you might be near the edge. If you re-run them, do they die at the same place?
On 6/30/2020 4:59 PM, McManus, Peter wrote:
?Hello,
I'm emailing for assistance regarding 6 hard-failed FS subjects (out of 1000) that were re-processed and still prematurely terminated at different pre-processing stages (using FS 7.1). The job was completed on a SLURM computing cluster affiliated with Boston Children's Hospital's E2 server, and I've attached the recon-all.logs for clarity, although none of the logs report an "exit with errors" but rather simply stop running.
Do you have any idea what the issue(s) might be? Hopefully this is enough information to work with, as I'm not the one who actually ran the commands and am a new RA trying to get acclimated to the computational environment at BCH. Thanks for the help.
Best,
Peter
_______________________________________________ Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edumailto:Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurferhttps://urldefense.proofpoint.com/v2/url?u=https-3A__mail.nmr.mgh.harvard.edu_mailman_listinfo_freesurfer&d=DwMDaQ&c=qS4goWBT7poplM69zy_3xhKwEW14JZMSdioCoppxeFU&r=EEMu7scmOKjD_QEgxQjNKCVSF3URg03gBg27Dnxc7f7iEmHuYP-gxu5x1ttLPnNx&m=4J9BYG3jRxP8l-5LI-jXOZWimkJZFRGl1ac8GkuNvos&s=u2apb6nfthTOIZeVvUt8_5WqXYqIhbkeuAt-i13nAz0&e=
freesurfer@nmr.mgh.harvard.edu