== THIS PAGE HAS BEEN MOVED TO SHAREPOINT! ==
Please refer to this site/make edits here for the most updated information: https://partnershealthcare.sharepoint.com/sites/LCN/SitePages/Morpho-Optimization-Project.aspx
----
<
>
<
>
<
>
Parent: MorphoOptimizationProject
Our algorithms are dealing with a 3D volume or surface. I am going to concentrate on the 3D mesh of vertexs, faces, and edges for this discussion.
Consider the problem of finding a maximum in a list of values. Serial code scanning from one end will easily find either the first or last of equal maxima, and get the same answer every time. Parallel code may find any of the maxima, and find a different one every time.
A similar problem is calculating an average. Floating point addition is not associative, (A+B)+C might differ slightly from A+(B+C). In serial code, the exact order of evaluation will differ depending on compiler optimizations. In parallel code it may differ depending on the number of threads, and the assignment of loop iterations to threads by the omp implementation.
A surface folding algorithm calculates a value for each point on a surface using floating point operations, and one of several almost identical values will be the fold point, resulting in different answers depending on very small differences.
== Reproducible OMP programming (ROMP) ==
With this in mind, a loop can be categorized as
1. Is being experimented on to see if it support parallel execution
1. Is known to get slightly acceptably different answers
1. Is thought to get exactly the same answer regardless of the number of threads
1. Is verified by eye or by checking code to do so
The romp_support.h allows all the parallel loops to be annotated saying which of these it is, and then to suppress executing the earlier ones when reproducibility is needed.
=== Reductions and similar sums ===
Reductions by summation of floating point values may produce different results from one execution of the parallel loop to another, especially if dynamic scheduling is used, or if an different number of threads is used.
To avoid this, there is support in the '''romp_for*.h''' files for decomposing the loop into the same chunks regardless of the number of threads, and summing the reductions in a stable manner. The use of .h files rather than macros leads to a better debugging experience.
The technique is seen in the '''romp_for*.h''' files.
1. determine the number of iterations needed
1. deterministically partition them into a few hundred partitions (The '''ROMP_Distributor''' struct and the '''ROMP_Distributor_begin''' do this
1. create temporaries for each partition that can be used for reducing the elements just of that partition
1. use an ''omp for loop'' to iterate over these partitions, doing any sums into the temporaries of that partition
1. do a serial loop combining the temporaries into the final destination
For convenience the '''romp_for*.h''' files make it easy to define a few such temporaries. If you need more temporaries
they can be similarly defined and processed surrounding the use of these files.
== Which loops to make parallel ==
The ROMP support is includes timing code, which produces a report on program exit showing which loops executed, how much time they used, and which of the above categories they were in.
This report has enabled us to identify all the important loops executed during recon-all, and enough of them have been modified to satisfy level 3 criteria at least, that there is no good reason to do the lower levels in parallel.
=== Parallelizing apparently serial algorithms ===
Some of our algorithms work across the whole surface or volume, one vertex or one face at a time. Because each vertex's behavior is influenced by the behavior of the rest, these algorithms are inheritently serial, and need to be replaced by a similar parallel algorithm.
[[mrisAsynchronousTimeStep_optionalDxDyDzUpdate]] is an excellent example of this.
In other cases, it was necessary to split a serial loop into several loops, only some of which could be parallelized.
=== Load balancing ===
In several places the original code looked approximately like
loop
if this iteration has something to do
do it
end if
end loop
If the expensive iterations were clustered this resulted in some of the chunks of iterations given to the parallel code being much faster than the others, and dynamic scheduling did not
significantly improve this.
To get good load balancing it was effective to recode this as two loops
loop
if this iteration has something to do
append it to a list
end if
end loop
then
for each item in the list loop
do it
end loop