Differences between revisions 2 and 9 (spanning 7 versions)
Revision 2 as of 2010-01-28 19:50:20
Size: 843
Comment:
Revision 9 as of 2010-02-04 12:29:12
Size: 2368
Editor: RichardEdgar
Comment:
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
* ''nick:'', SSE math lib - Replace instances of sin, cos, log and exp with routines optimized for SSE instructions found on intel processors. See http://gruntthepeon.free.fr/ssemath/. Tried this, but ran into problems and gave up. Some wrangling with it could make it possible to optionally build with this lib via #ifdefs. * ''nick:'', SSE math lib - Replace instances of sin, cos, log and exp with routines optimized for SSE instructions found on intel processors. See http://gruntthepeon.free.fr/ssemath/. Tried this, but ran into problems and gave up. Some wrangling with it could make it possible to optionally build with this lib via #ifdefs. Actually, its not a lib but a header file.

* ''Richard:'', MRI data structure - pointer-chasing implied by `***slices` is horrific for the CPU caches (and a non-starter on the GPU). The 'chunking' alternative is much better, but needs to be used uniformly

* ''Richard:'', MATRIX data type - If this is only used for 4x4 affine transformations, it should be coded as such. If used more generally too, then a separate 'Affine' class should be considered (this exists for the GPU already in the file `affinegpu.cu`)

* ''Richard:'', Boundary conditions - MRIconvolve1d and MRImean handle out-of-range accesses differently. MRImean effectively returns zero, MRIconvolve1d uses the `[x|y|z]i` pointers which clamp to the edge of the range. There are probably other places where this happens. A uniform treatment would be best.

* ''Richard:'', Data structure memory management - Datastructures which allocate RAM, or are allocated as arrays should always carry their lengths with them. I'm thinking particularly of GCA_SAMPLE arrays here, but I imagine there are other examples

* ''Richard:'', MRI data structure - Do we really need support for all of UCHAR, SHORT, LONG and FLOAT? On the GPU, manipulating datatypes which aren't 4-byte aligned is slow, and I imagine that modern CPUs face similar difficulties. They do save some RAM, but it's only a factor of four; if we want to edit bigger volumes or long sequences, we should be thinking about better datastructures, not trying to 'cram down' the existing ones.

Notes on FreeSurfer code optimization

This page is for free-form entry of notes on ways to optimize the freesurfer code base, whether they be simple things, or notes on larger scale problems.

Format is: name:, <short label> - description

* nick:, -ffast-math - Try the -ffast-math flag of gcc v4.x. Prior experiments with this on the AMD compiler showed output differences in recon-all, but perhaps selective use of this flag is possible.

* nick:, SSE math lib - Replace instances of sin, cos, log and exp with routines optimized for SSE instructions found on intel processors. See http://gruntthepeon.free.fr/ssemath/. Tried this, but ran into problems and gave up. Some wrangling with it could make it possible to optionally build with this lib via #ifdefs. Actually, its not a lib but a header file.

* Richard:, MRI data structure - pointer-chasing implied by ***slices is horrific for the CPU caches (and a non-starter on the GPU). The 'chunking' alternative is much better, but needs to be used uniformly

* Richard:, MATRIX data type - If this is only used for 4x4 affine transformations, it should be coded as such. If used more generally too, then a separate 'Affine' class should be considered (this exists for the GPU already in the file affinegpu.cu)

* Richard:, Boundary conditions - MRIconvolve1d and MRImean handle out-of-range accesses differently. MRImean effectively returns zero, MRIconvolve1d uses the [x|y|z]i pointers which clamp to the edge of the range. There are probably other places where this happens. A uniform treatment would be best.

* Richard:, Data structure memory management - Datastructures which allocate RAM, or are allocated as arrays should always carry their lengths with them. I'm thinking particularly of GCA_SAMPLE arrays here, but I imagine there are other examples

* Richard:, MRI data structure - Do we really need support for all of UCHAR, SHORT, LONG and FLOAT? On the GPU, manipulating datatypes which aren't 4-byte aligned is slow, and I imagine that modern CPUs face similar difficulties. They do save some RAM, but it's only a factor of four; if we want to edit bigger volumes or long sequences, we should be thinking about better datastructures, not trying to 'cram down' the existing ones.

OptimizationNotes (last edited 2021-09-22 11:51:29 by DevaniCordero)