On Tue, Jul 3, 2012 at 5:56 PM, Akio Yamamoto yamamoto@tkl.iis.u-tokyo.ac.jp wrote:
Yes, as Richard pointed out, I just wanted to know the numbers for input to Amdahl's law, if you have already something, to figure out the maximum expected speedup using multiple processors/cores.
As for improvements of em_reg, I'll try to split each transform as well as parallelize the energy evaluation.
Parallelising the energy evaluation is the lowest hanging fruit, and is what happens in the 'slow' GPU version. But for highest performance, I would convert the nested transform loops into a single one, and farm those out between OpenMP threads (I wouldn't bother trying nested parallelism of the energy evaluation, although you might want to do some SSE tinkering). That is effectively what happens in the 'fast' GPU version - you can use the same basic structure. But be aware that the slightly different transforms which result can cause you to converge to a different solution.
HTH,
Richard