We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
“The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements”, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
Hi Peter
thanks for the info. Feel free to post this on one of the *many* blogs howling for our blood :) I think that the effects in the paper reflect default floating point settings in gcc on 32-bit vs. 64-bit, although we haven't really investigated as it doesn't seem like a wise use of limited resources (since no one would ever do a study that way in any case).
Bruce
On Wed, 20 Jun 2012, Peter J. Molfese wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
?The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements?, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
I'm curious: For comparison to the results in that paper, has anyone quantified the variability that results when one runs the same FS version repeatedly on the same subject, but with a different random seed each time? That is, how much of the difference is related to math libraries vs. intrinsic variability that arises from the components of FS that use a random seed?
cheers, -MH
On Wed, 2012-06-20 at 10:43 -0400, Bruce Fischl wrote:
Hi Peter
thanks for the info. Feel free to post this on one of the *many* blogs howling for our blood :) I think that the effects in the paper reflect default floating point settings in gcc on 32-bit vs. 64-bit, although we haven't really investigated as it doesn't seem like a wise use of limited resources (since no one would ever do a study that way in any case).
Bruce
On Wed, 20 Jun 2012, Peter J. Molfese wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
?The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements?, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
_______________________________________________ Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Nick might have, not sure. Are you volunteering Mike :)
On Wed, 20 Jun 2012, Michael Harms wrote:
I'm curious: For comparison to the results in that paper, has anyone quantified the variability that results when one runs the same FS version repeatedly on the same subject, but with a different random seed each time? That is, how much of the difference is related to math libraries vs. intrinsic variability that arises from the components of FS that use a random seed?
cheers, -MH
On Wed, 2012-06-20 at 10:43 -0400, Bruce Fischl wrote:
Hi Peter
thanks for the info. Feel free to post this on one of the *many* blogs howling for our blood :) I think that the effects in the paper reflect default floating point settings in gcc on 32-bit vs. 64-bit, although we haven't really investigated as it doesn't seem like a wise use of limited resources (since no one would ever do a study that way in any case).
Bruce
On Wed, 20 Jun 2012, Peter J. Molfese wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
?The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements?, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
_______________________________________________ Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Mike,
I have run such tests in the past (which showed about 2-3% variability in hippo volume due to randomness) and Doug has done similar tests, but admittedly running a large scale analysis and showing the results on a wiki page would be useful. Something like running our Buckner40 and/or ADNI60 data set, making 20 copies of each subject and including the -randomness flag, and plotting the mean/std of the volume and surface rois.
Nick
Nick might have, not sure. Are you volunteering Mike :)
On Wed, 20 Jun 2012, Michael Harms wrote:
I'm curious: For comparison to the results in that paper, has anyone quantified the variability that results when one runs the same FS version repeatedly on the same subject, but with a different random seed each time? That is, how much of the difference is related to math libraries vs. intrinsic variability that arises from the components of FS that use a random seed?
cheers, -MH
On Wed, 2012-06-20 at 10:43 -0400, Bruce Fischl wrote:
Hi Peter
thanks for the info. Feel free to post this on one of the *many* blogs howling for our blood :) I think that the effects in the paper reflect default floating point settings in gcc on 32-bit vs. 64-bit, although we haven't really investigated as it doesn't seem like a wise use of limited resources (since no one would ever do a study that way in any case).
Bruce
On Wed, 20 Jun 2012, Peter J. Molfese wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
?The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements?, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
_______________________________________________ Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Partners Compliance HelpLine at http://www.partners.org/complianceline . If the e-mail was sent to you in error but does not contain patient information, please contact the sender and properly dispose of the e-mail.
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
Hi Bruce et al,
I may be late to the discussion, but wanted to share some insights given that we've had some headaches trying to get identical results on presumptively identical systems for FreeSurfer and other tools, I wanted to add my two cents. Ultimately, of course, the systems were not 100% identical, and the differences resulted in some down-stream libraries using 32-bit code in stead of 64-bit. Some differences had been observed in other software packages to a lesser extent, but I'm confident that had we compared any other software which relied on complex iterative algorithms using these libraries, the differences would have been more pronounced. Re-imaging the systems to ensure that all libraries were identical resulted in identical output. Then when migrating from RHEL 4.x to 5.x to 6.x, we did some similar library checking and again were able to produce identical results.
It is important for the neuroimaging community to understand that reproducibility is in large part their responsibility and not that of the software package or operating system developers. It is effectively impossible to guarantee identical results on non-identical systems. Unfortunately, "identical" can for some analysis leave little room for differences (perhaps even the random seed differences as just suggested by Michael Harms), but notwithstanding hardware related interference, bugs, or errors, identical systems should generally produce identical results.
I forwarded the paper that sparked this thread in jest to friends saying "why can't things be simple?" but in reality, as Bruce mentioned, this is not in the least a surprise. I've been involved in testing output and fixing it between software and system updates in clinical and research settings since 1999, and I'm pretty sure I was not the first. There is a reason the FDA does not want you upgrading even Mine Sweep on a validated Windows system without re-validating. Researchers need to think similarly (if not quite as extremely).
Some additional notes for those of you who may not be aware:
1. The USER environment can affect results. On GNU/Linux systems, for example, modifying the $PATH or $LD_LIBRARY_PATH variables may result in different output from the same executable on the same system by different users (or the same user under different shells). Mac and Windows can have similar issues, particularly when concerning "power users". 2. Statically compiling software does not eliminate the use of dynamically loaded libraries (see a good explanation at "Linking libstdc++ staticallyhttp://www.trilithium.com/johan/2005/06/static-libstdc/"). So even statically compiled software can be affected by other libraries on the system. 3. All other things being identical, using Intel vs. AMD x86 compatible chips should not affect the output; however, going to ARM, RISC, or GPU chips where floating point representations are IEEE compliant but different would virtually guarantee different results even if all libraries are the same for all but the simplest calculations. This means that unfortunately you'll likely never be able to reproduce your 1,000 node cluster results on your iPad -- no matter how cool or powerful it gets.
All that being said, if you do run your entire experiment twice, using two different systems that differ only in their IEEE-compliant double precision floating point implementation and the results are significantly different (e.g., running on a XEON cluster the hippocampal volumes of group A and group B are different and running on an NVIDIA GPU cluster they are not), that would bring into question the validity/reliability of the analysis. I have not seen any evidence of that.
That may have been more than two cents.
-Gabriele
P.S. I'm *not* going to chime in on differences between versions, since I can't imagine how a segmentation algorithm (for example) would have gotten more accurate and yet have produced the same results.
p.s. I should add that we've known about this effect for a while (as the authors in the paper state), but haven't had the time to track it down. Since it's an avoidable source of variance (by controlling what computers you run the analysis on), it's lower on our list of priorities than other improvements. As far as the other part of the paper comparing different versions I'll admit I'm puzzled as to why anyone would find it surprising that different versions yield different answers. Would you want an MRI analysis package to be the same in 2012 as in 1999? We're always trying to improve things - increase accuracy and remove unecessary sources of variance - so inevitably different versions give different results.
In any case, it's nice to get some posititive feedback :)
Bruce
On Wed, 20 Jun 2012, Peter J. Molfese wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
?The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements?, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
Please post it on PLoS One homepage. According to the policies of this journal, you are allowed to respond to each paper directly on the same page as the article. You don't need to submit it as a paper and won't go through a review process.
On Wed, Jun 20, 2012 at 10:28 AM, Peter J. Molfese < pmolfese@haskins.yale.edu> wrote:
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
“The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements”, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at: http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
Peter,
Thanks for this info. The Mac OS versions compared in the paper were 10.5 and 10.6 (Leopard and Snow Leopard). One of the major changes between these two versions was the switch from a 32b kernel to a 64b kernel (whereas Snow Leopard and Lion both use a 64b kernel). I havent been able to definitely say that the math libs changed, but it would seem to account for the differences in Mac OS version results in the paper. A 32b build of freesurfer (for the Mac) was used in both tests so kernel differences might not matter but are suspect.
Nick
We run things on an Xgrid cluster that is now a mixture of Macs running Snow Leopard (10.6.8) and Lion (10.7.x). We found the results given from asegstats2table and aparcstats2table are identical after running 100 subjects on both Mac OS X 10.7.4 and 10.6.8 with Freesurfer 5.1. I also ran a few subjects on the cluster and on individual computers (not on the cluster so to speak) and the results are identical.
Best, Peter
Hi,
The paper entitled
The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements, PLoSONE, Vol 7(6), e38234 (2012)
may be of interest to all of you. It can be found at:
http://dx.plos.org/10.1371/journal.pone.0038234
Cheers, Ed
Freesurfer mailing list Freesurfer@nmr.mgh.harvard.edu https://mail.nmr.mgh.harvard.edu/mailman/listinfo/freesurfer
freesurfer@nmr.mgh.harvard.edu