THIS PAGE HAS BEEN MOVED TO SHAREPOINT!
Please refer to this site/make edits here for the most updated information: https://partnershealthcare.sharepoint.com/sites/LCN/SitePages/Morpho-Optimization-Project.aspx
Parent: MorphoOptimizationProject
Our algorithms are dealing with a 3D volume or surface. I am going to concentrate on the 3D mesh of vertexs, faces, and edges for this discussion.
Consider the problem of finding a maximum in a list of values. Serial code scanning from one end will easily find either the first or last of equal maxima, and get the same answer every time. Parallel code may find any of the maxima, and find a different one every time.
A similar problem is calculating an average. Floating point addition is not associative, (A+B)+C might differ slightly from A+(B+C). In serial code, the exact order of evaluation will differ depending on compiler optimizations. In parallel code it may differ depending on the number of threads, and the assignment of loop iterations to threads by the omp implementation.
A surface folding algorithm calculates a value for each point on a surface using floating point operations, and one of several almost identical values will be the fold point, resulting in different answers depending on very small differences.
Reproducible OMP programming (ROMP)
With this in mind, a loop can be categorized as
- Is being experimented on to see if it support parallel execution
- Is known to get slightly acceptably different answers
- Is thought to get exactly the same answer regardless of the number of threads
- Is verified by eye or by checking code to do so
The romp_support.h allows all the parallel loops to be annotated saying which of these it is, and then to suppress executing the earlier ones when reproducibility is needed.
Reductions and similar sums
Reductions by summation of floating point values may produce different results from one execution of the parallel loop to another, especially if dynamic scheduling is used, or if an different number of threads is used.
To avoid this, there is support in the romp_for*.h files for decomposing the loop into the same chunks regardless of the number of threads, and summing the reductions in a stable manner. The use of .h files rather than macros leads to a better debugging experience.
The technique is seen in the romp_for*.h files.
- determine the number of iterations needed
deterministically partition them into a few hundred partitions (The ROMP_Distributor struct and the ROMP_Distributor_begin do this
- create temporaries for each partition that can be used for reducing the elements just of that partition
use an omp for loop to iterate over these partitions, doing any sums into the temporaries of that partition
- do a serial loop combining the temporaries into the final destination
For convenience the romp_for*.h files make it easy to define a few such temporaries. If you need more temporaries they can be similarly defined and processed surrounding the use of these files.
Which loops to make parallel
The ROMP support is includes timing code, which produces a report on program exit showing which loops executed, how much time they used, and which of the above categories they were in.
This report has enabled us to identify all the important loops executed during recon-all, and enough of them have been modified to satisfy level 3 criteria at least, that there is no good reason to do the lower levels in parallel.
Parallelizing apparently serial algorithms
Some of our algorithms work across the whole surface or volume, one vertex or one face at a time. Because each vertex's behavior is influenced by the behavior of the rest, these algorithms are inheritently serial, and need to be replaced by a similar parallel algorithm.
mrisAsynchronousTimeStep_optionalDxDyDzUpdate is an excellent example of this.
In other cases, it was necessary to split a serial loop into several loops, only some of which could be parallelized.
Load balancing
In several places the original code looked approximately like
- loop
- if this iteration has something to do
- do it
- if this iteration has something to do
If the expensive iterations were clustered this resulted in some of the chunks of iterations given to the parallel code being much faster than the others, and dynamic scheduling did not significantly improve this.
To get good load balancing it was effective to recode this as two loops
- loop
- if this iteration has something to do
- append it to a list
- if this iteration has something to do
then
- for each item in the list loop
- do it