|
Size: 1510
Comment:
|
Size: 2992
Comment:
|
| Deletions are marked like this. | Additions are marked like this. |
| Line 4: | Line 4: |
| mris_fix_topology has a hot loop, which does unnecessary calculations of the face normals |
==== mris_fix_topology has a hot loop, which does unnecessary calculations of the face normals ==== |
| Line 8: | Line 7: |
| _:_: then it loops over a set of patches, or iterates on one patch. For each patch it calls | _:_: it loops over a set of patches, or iterates on one patch. For each patch it calls |
| Line 20: | Line 19: |
| To simplify, the original code does this | ==== To simplify, the original code does this ==== _: loop |
| Line 22: | Line 22: |
| loop | _:_: for all fno normal[fno] = f (inputs) |
| Line 24: | Line 24: |
| for all fno normal[fno] = f (inputs) | _:_: for a few fno use normal[fno] |
| Line 26: | Line 26: |
| for a few fno use normal[fno] | _: end loop |
| Line 28: | Line 28: |
| change some inputs | _: change some inputs |
| Line 30: | Line 30: |
| end loop | _: use some normal[fno] |
| Line 32: | Line 32: |
| change some inputs | ''Since only a few of the face normals are used, it is a waste of time to calculate all of them every time around the loop! '' |
| Line 34: | Line 34: |
| use some normal[fno] | ==== This is replaced by code that does ==== _: loop |
| Line 36: | Line 37: |
| Since only a few of the face normals are used, it is a waste of time to calculate all of them every time around the loop! | _:_: for all fno: normal[fno].deferred = true |
| Line 38: | Line 39: |
| This is replaced by code that does | _:_: for a few fno: use normal[fno], computing it from f (inputs) if is deferred, and changing it to being no-longer deferred |
| Line 40: | Line 41: |
| loop | _: end loop |
| Line 42: | Line 43: |
| for all fno __normal[fno].deferred = true__ | _: compute any remaining deferred normal[fno] before their inputs can change |
| Line 44: | Line 45: |
| for a few fno __if normal[fno].deferred normal[fno] = f (inputs); normal[fno].deferred = false__ | _: change some inputs |
| Line 46: | Line 47: |
| change some inputs | _: use some normal[fno] |
| Line 48: | Line 49: |
| end loop | ==== How can we test it gets the right answers? ==== When checking code is enabled, the normal[fno] is computed from the inputs twice - once when it is deferred, and again when it is undeferred. It is checked that these two values are the same. |
| Line 50: | Line 52: |
| __for a few fno if normal[fno].deferred normal[fno] = f (inputs); normal[fno].deferred = false__ | ==== How do we make sure that the undeferring is done? ==== If we left the __FACE nx,ny,nz,orig_area__ members unchanged, then it would be hard to find all the users. We could just rename them, but it is easier to place them in a struct and have a function called to get a pointer to the struct, and have the function do the undeferral. This is a standard data hiding technique, implemented here in C. Now all the places that need to be changed are compilation errors. |
| Line 52: | Line 55: |
| change some inputs | '''Why put the structs into a separate vector?''' |
| Line 54: | Line 57: |
| use some normal[fno] | If we coded this as __MRIS__ has a vector of __FACE__ which has one __FaceNormCacheEntry__ member, then the loop that sets deferred = true would bring in one fno per cache line fetch. Since there are about 100,000 faces, this would bring in about 100,000 cache lines. If instead we code this as MRIS has a vector of __FaceNormCacheEntry,__ then the same loop will bring in about 3 fno per cache line, which will run about three times faster. |
| Line 56: | Line 60: |
| It would be even faster if we separated the __deferred__ flags from the __nx,ny,nz,orig_area__ then the loop would go even faster - 64 fno per cache line - but every other use of them would require 2 cache line fetches. I did not measure to see whether this would be faster, because the loop was no longer important. |
A variety of techniques have been used.
Deferring calculations until needed
mris_fix_topology has a hot loop, which does unnecessary calculations of the face normals
mrisComputeOptimalRetessellation and mrisComputeRandomRetessellation have a similar structure
_:_: it loops over a set of patches, or iterates on one patch. For each patch it calls
_:_:_: mrisDefectPatchFitness, which calls
_:_:_:_: mrisComputeDefectLogLikelihood, which calls
_:_:_:_:_: mrisComputeDefectMRILogUnlikelihood, which
_:_:_:_:_:_: does an expensive computation all the face normals for ALL the faces
_:_:_:_:_:_: does two other expensive steps, which only use a few of the face normals
To simplify, the original code does this
_: loop
_:_: for all fno normal[fno] = f (inputs)
_:_: for a few fno use normal[fno]
_: end loop
_: change some inputs
_: use some normal[fno]
Since only a few of the face normals are used, it is a waste of time to calculate all of them every time around the loop!
This is replaced by code that does
_: loop
_:_: for all fno: normal[fno].deferred = true
_:_: for a few fno: use normal[fno], computing it from f (inputs) if is deferred, and changing it to being no-longer deferred
_: end loop
_: compute any remaining deferred normal[fno] before their inputs can change
_: change some inputs
_: use some normal[fno]
How can we test it gets the right answers?
When checking code is enabled, the normal[fno] is computed from the inputs twice - once when it is deferred, and again when it is undeferred. It is checked that these two values are the same.
How do we make sure that the undeferring is done?
If we left the FACE nx,ny,nz,orig_area members unchanged, then it would be hard to find all the users. We could just rename them, but it is easier to place them in a struct and have a function called to get a pointer to the struct, and have the function do the undeferral. This is a standard data hiding technique, implemented here in C. Now all the places that need to be changed are compilation errors.
Why put the structs into a separate vector?
If we coded this as MRIS has a vector of FACE which has one FaceNormCacheEntry member, then the loop that sets deferred = true would bring in one fno per cache line fetch. Since there are about 100,000 faces, this would bring in about 100,000 cache lines. If instead we code this as MRIS has a vector of FaceNormCacheEntry, then the same loop will bring in about 3 fno per cache line, which will run about three times faster.
It would be even faster if we separated the deferred flags from the nx,ny,nz,orig_area then the loop would go even faster - 64 fno per cache line - but every other use of them would require 2 cache line fetches. I did not measure to see whether this would be faster, because the loop was no longer important.
