Ways to reduce serial code elapsed time
Many books, papers, and courses are available about this topic. Here is a very brief overview, with descriptions of the techniques being applied to Freesurfer.
The execution of the program requires fetching both instructions and data from memory, via a multi-layered cache, executing the instructions, and storing the results. Getting the data from the memory and distant levels of the cache can be between tens and thousands of times slower than the instruction execution.
To reduce the execution time, find ways of removing data movement and instruction execution. Data movement is reduced by reordering the data placement to pack the 64 byte cache line fetches with useful data, and by reordering the instructions so that fetched data is reused while it is available rather than having to refetch it. Instruction execution is removed by looking for ways to eliminate large numbers of operations, and then by looking more closely at just the loops in the code that take the most time.
Using cheaper operations
Square root and trig functions can be surprisingly expensive when used in hot loops.
'sqrt' can often be replaced by observing that 'sqrt(a) < sqrt(b)' is the same as 'a < b', so rather than choosing the shortest length of a line (length requiring the calculation of sqrt), calculate the shortest length-squared and then if really needed, take the sqrt only of the shortest length.
Trig functions can sometimes be replaced by just the first few terms of their taylor series. Near zero, sin(x) is x-x**3/6, cos(x) is 1-x**2/2. asin and acos and atan have similar simple approximations.
Sometimes it is better to defer calculations until the result is definitely needed
It is quick and simple to write code that says
- Compute some properties of all vertexs into per-vertex data member
- Loop over some subset of the vertexs, using those properties
but this approach is only efficient if most of the data computed in the first loop is used in the second.
If the second loop does not use all the data, it may be best to instead defer the calculation until it is needed. This risks redoing the calculation, which could also be expensive, so it may be necessary to add a cache. Adding the cache leads to a problem determining when to invalidate the cache, and how to quickly invalidate many entries in the cache, and how to build a thread-safe cache. It is not surprising that, for all but the hot code, the quick and simple approach is often best.
For an example of building a deferral mechanism, see
Cheaper matrix operations
The code has several different implementations of matrices and matrix operations, coming from varying sources. However almost all the matrices used are 1x3 or 3x3 or, sometimes 1x4, and 4x4. It is rare to exceed this. However almost all the code is written using generalized algorithms that do not know this.
Some measurements were suggesting that the locking caused by using malloc and free were contributing the to serializing of parallel code, so I made changes to eliminate these mallocs and frees, as described here.
Using specialized algorithms requiring auxiliary data structures
The code has the essential large data structures of the MRIS with its arrays of VERTEX and FACE, and algorithms can often be thought of, and coded, as manipulations of these. Sometimes the algorithm repeatedly makes a decision about the same entities in this data structure, and time can be saved by making it possible to ignore many of them or to visit them just once, rather than repeatedly, during the execution.
Using hash tables to accelerate searches
mrisReadSTLfile is an example of a using a hash table to avoid repeatedly searching many objects for ones that match some property.
Exploiting the spatial nature of Freesurfer's data
Our data represents a physical reality where only close things interact with each other. The vertexs and faces etc. are numbered, but sadly the physical closeness is not fully captured by the numbering, because of the folding of the surface. However by partitioning the vertexs etc. by their physical location, some of the hottest loops can have many of their iterations dropped from consideration.
There are several places in the code where vertexs are moved to improve some measurement, but their movement must be constrained by not introducing intersections.
Rather than testing one entity involved in the intersection against every other entity, several techniques are used to reduce possible suspects
- mrishash.c maps every face into a 3D voxellation of the space. It knows that the faces will not intersect if their voxellations don't. This gives good results but both the insertion and lookup are quite expensive
- realm.c creates a 3D tree of the space, maps each vertex into one node in the tree, and maps each face into the common ancestor of the vertex nodes. It knows that faces will not intersect if they don't share subtrees. This approach has fast insertion, removal, and lookup. The structure has to be maintained as the defect elimination code adds and removes entities as well as moving them.
realm.c also optimizes the determination of intersecting great arcs during defect repair
Making such algorithms possible
Manipulations of related fields should be done in only a few places
Consider three of MRIS's members- max_vertices, nvertices, vertices
There are requirements that the rest of the code relies on
nvertices <= max_vertices
max_vertices must be the number of elements vertices points to
if the max_vertices is increased, the vertices must be realloc'ed
if the vertices is increased, the new VERTEX elements must be bzero'ed
Maintaining theses rule is difficult when many places in the code are assign to nvertices.
To stop this from happening, I changed nvertices to a const int, causing all the assignments to become error messages. Then I wrote a small set of functions for manipulating these three values, and editted all the errors to use these functions.
The same problem and solution was adopted with nfaces. Here is was harder and more important because there are multiple vectors that must be nfaces long, and the original code was changing nfaces in several places without updating all of the relevant vectors.
To compute or to cache... that is the question
The gcas->prior member illustrates this tradeoff beautifully. Almost all the hot uses of it needed log(gcas->prior), and the call to the log was hot, so the member was renamed and the new code sets it via a call to gcas_setPrior(gcas,prior). This call computes and saves both the prior and its log.
Calls to get the log via gcas_getPriorLog can then fetch the precomputed value.
Since setting the log did not show up as a hot spot, and since it appeared at least at a cursory examination that all the logs were repeatedly used, I did not investigate adding a full cache mechanism.
See the full pull request at https://github.com/freesurfer/freesurfer/pull/216/files