Parent: MorphoOptimizationProject

Ways to reduce serial code elapsed time

Background

Many books, papers, and courses are available about this topic. Here is a very brief overview, with descriptions of the techniques being applied to Freesurfer.

The execution of the program requires fetching both instructions and data from memory, via a multi-layered cache, executing the instructions, and storing the results. Getting the data from the memory and distant levels of the cache can be between tens and thousands of times slower than the instruction execution.

To reduce the execution time, find ways of removing data movement and instruction execution. Data movement is reduced by reordering the data placement to pack the 64 byte cache line fetches with useful data, and by reordering the instructions so that fetched data is reused while it is available rather than having to refetch it. Instruction execution is removed by looking for ways to eliminate large numbers of operations, and then by looking more closely at just the loops in the code that take the most time.

In short: use fewer operations or use cheaper operations.

However it is important to remember that a computer has distinct subsystems

  1. Core's doing integer operations
  2. Floating point units doing operations such as add, sub, mul
  3. Floating point units doing expensive operations such as div, sqrt
  4. A Hierarchy of memory caches, connected by memory buses - see https://software.intel.com/en-us/articles/why-efficient-use-of-the-memory-subsystem-is-critical-to-performance

One needs to determine which of these subsystems is the bounding factor on the performance of each portion of the program.

As described in MorphoOptimizationProject_improvingParallelism, there are several ways to measure the performance, but Intel's VTune tool provides insights that are hard to get other ways. It is always best to measure to find the places in the code that are truly hot, to find the real cause of the heat, and to guide the fix.

Using fewer operations

Sometimes it is better to defer calculations until the result is definitely needed

It is quick and simple to write code that says

  1. Compute some properties of all vertexs into per-vertex data member
  2. Loop over some subset of the vertexs, using those properties

but this approach is only efficient if most of the data computed in the first loop is used in the second.

If the second loop does not use all the data, it may be best to instead defer the calculation until it is needed. This risks redoing the calculation, which could also be expensive, so it may be necessary to add a cache. Adding the cache leads to a problem determining when to invalidate the cache, and how to quickly invalidate many entries in the cache, and how to build a thread-safe cache. It is not surprising that, for all but the hot code, the quick and simple approach is often best.

For an example of building a deferral mechanism, see

Cheaper matrix operations

The code has several different implementations of matrices and matrix operations, coming from varying sources. However almost all the matrices used are 1x3 or 3x3 or, sometimes 1x4, and 4x4. It is rare to exceed this. However almost all the code is written using generalized algorithms that do not know this.

Some measurements were suggesting that the locking caused by using malloc and free were contributing the to serializing of parallel code, so I made changes to eliminate these mallocs and frees, as described here.

Cheaper sorting

qsort is used extensively in our code, and it is an excellent sorting algorithm, but it does have two significant penalties that may make it worth replacing in hot code

  1. It uses a call-back to do the comparison
  2. It does not use parallelism

There was one place in the serial code for mris_fix_topology where sorting was important, and so I added include/sort.h to provide a sort that doesn't have these handicaps.

Using specialized algorithms requiring auxiliary data structures

The code has the essential large data structures of the MRIS with its arrays of VERTEX and FACE, and algorithms can often be thought of, and coded, as manipulations of these. Sometimes the algorithm repeatedly makes a decision about the same entities in this data structure, and time can be saved by making it possible to ignore many of them or to visit them just once, rather than repeatedly, during the execution.

Using hash tables to accelerate searches

mrisReadSTLfile is an example of a using a hash table to avoid repeatedly searching many objects for ones that match some property.

See https://github.com/freesurfer/freesurfer/pull/196

Exploiting the spatial nature of Freesurfer's data

Our data represents a physical reality where only close things interact with each other. The vertexs and faces etc. are numbered, but sadly the physical closeness is not fully captured by the numbering, because of the folding of the surface. However by partitioning the vertexs etc. by their physical location, some of the hottest loops can have many of their iterations dropped from consideration.

There are several places in the code where vertexs are moved to improve some measurement, but their movement must be constrained by not introducing intersections.

Rather than testing one entity involved in the intersection against every other entity, several techniques are used to reduce possible suspects

  1. mrishash.c maps every face into a 3D voxellation of the space. It knows that the faces will not intersect if their voxellations don't. This gives good results but both the insertion and lookup are quite expensive
  2. realm.c creates a 3D tree of the space, maps each vertex into one node in the tree, and maps each face into the common ancestor of the vertex nodes. It knows that faces will not intersect if they don't share subtrees. This approach has fast insertion, removal, and lookup. The structure has to be maintained as the defect elimination code adds and removes entities as well as moving them.
  3. realm.c also optimizes the determination of intersecting great arcs during defect repair

The spatial nature of the data also makes it possible to do quick checks before doing expensive checks to filter out some candidates, for instance MorphoOptimizationProject_distanceToNearestFace

Using cheaper operations

Transcendental Functions

Square root and trig functions can be surprisingly expensive when used in hot loops.

'sqrt' can often be replaced by observing that 'sqrt(a) < sqrt(b)' is the same as 'a < b', so rather than choosing the shortest length of a line (length requiring the calculation of sqrt), calculate the shortest length-squared and then if really needed, take the sqrt only of the shortest length.

Trig functions can sometimes be replaced by just the first few terms of their taylor series. Near zero, sin(x) is x-x**3/6, cos(x) is 1-x**2/2. asin and acos and atan have similar simple approximations.

Memory traffic

See MorphoOptimizationProject_ReducingMemoryTraffic.

Making such algorithms possible

Manipulations of related fields should be done in only a few places

Consider three of MRIS's members- max_vertices, nvertices, vertices

There are requirements that the rest of the code relies on

Maintaining theses rule is difficult when many places in the code are assign to nvertices.

To stop this from happening, I changed nvertices to a const int, causing all the assignments to become error messages. Then I wrote a small set of functions for manipulating these three values, and editted all the errors to use these functions.

The same problem and solution was adopted with nfaces. Here is was harder and more important because there are multiple vectors that must be nfaces long, and the original code was changing nfaces in several places without updating all of the relevant vectors.

To compute or to cache... that is the question

The gcas->prior member illustrates this tradeoff beautifully. Almost all the hot uses of it needed log(gcas->prior), and the call to the log was hot, so the member was renamed and the new code sets it via a call to gcas_setPrior(gcas,prior). This call computes and saves both the prior and its log.

Calls to get the log via gcas_getPriorLog can then fetch the precomputed value.

Since setting the log did not show up as a hot spot, and since it appeared at least at a cursory examination that all the logs were repeatedly used, I did not investigate adding a full cache mechanism.

See the full pull request at https://github.com/freesurfer/freesurfer/pull/216/files