Differences between revisions 14 and 15
Deletions are marked like this. Additions are marked like this.
Line 107: Line 107:

=== Look at the result ===
There is a really simple summary printed
{{{
Summary
-------
Elapsed Time: 31.787
Paused Time: 0.0
CPU Time: 66.530
Average CPU Utilization: 2.095
CPI Rate: 1.014
}}}
which tells us that we are only effectively using 2.095 of the available 4 cores. However to really see what is happening we need to run the GUI.
{{{
amplxe-gui r000hs/r000hs.amplxe &
}}}
The {{{r000hs}}} will have a new number for each run, and the {{{hs}}} stands for ''hotspot''. We will see different numbers and abbreviations in later analyses.

The Project Navigator on the LHS is worthless. Click its X button to close it.

The '''Top Hotspots''' shows us
{{{
Function Module CPU Time
func@0x11ad0 libgomp.so.1 13.6
mrisComputeDefectMRILogUnlikelihood_wkr._omp_fn.47 mris_fix_topology 9.5
}}}

This immediately shows the application is spending a lot of CPU time in the gomp stuff - locking or waiting!

Parent: MorphoOptimizationProject

Causes and Cures

There are four main causes for parallelism not resulting in speedup

  1. Insufficient work per thread
  2. Excessive locking
  3. Excessive memory traffic
  4. Work not spread equally between the threads, aka load imbalance

Insufficient work per thread

OpenMP starts the threads when it needs to, but keeps them idle and can assign work to them and start them fairly quickly. Never-the-less, it takes thousands of instructions to give a thread work, and so there must be thousands of instructions worth of work to be done. Because OpenMP assigns more than one iteration of a loop to a thread at once, this means that the number of iterations divided by the number of threads, multiplied by the amount of work in one iteration needs to be in the thousands of instructions. This is usually the case. Unless you use static(1) - don't do that unless each iteration has a lot of work to do! In general, let OpenMP decide how to schedule the work unless it results in an imbalance.

Excessive locking

Locking and unlocking takes only a few instructions - unless another thread is competing for the same lock. Bouncing locks between threads is expensive.

  1. #pragma omp critical should only be used if it locks less than 1% of the work, and you are sure that it is not nested.

  2. omp_lock_t operations are safer, but you may need to use lock the initialization of the lock! Partition complex data structures and have a lock per partition to to reduce contention. This is good for caches where multiple threads provide cached results for each other.

  3. Create per-thread data structures to hold each thread's contribution and, if necessary, merge the results after the parallel loop has exited. It is often possible to parallelize the merge.

Excessive memory traffic

This is the biggest, and least understood, performance killer. If the data is written by one thread and is in its 256KB L2 cache, it can take the equivalent of twenty or more instructions to move the 64byte cache line it is in to another thread.

For example, if you have an 128KB float array written in one loop, and read in the following loop, it may require at least three or four operations per element to cover the cost of moving the data. There is a lot of system-specific variation in these numbers.

If most of the data is coming out of the L3 cache or the DRAM, then that can become the bottleneck very easily. This is also a problem for serial code, so the issue is described in MorphoOptimizationProject_ReducingMemoryTraffic

Load Imbalance

If the work is not evenly spread amongst the iterations, but instead is concentrate in a few regions of it, then the threads that don't get assigned such regions must wait for those that do.

This is quite possible given the spatial locality in our data.

Diagnosis

There is one easy way to see if more parallelism can speed up your program - If two copies of the program can run simultaneously in about the same time that one copy does, then there is potential for additional parallelism to speed up one copy.

There are three levels of tools to diagnose why this potential exists.

  1. Timing code built into the application.
  2. oprofile et. al.
  3. Intel's VTune product.

Of these, the VTune product is, by far, the easiest to use. Writing and understanding either the timing code or using oprofile are significantly more difficult and can not achieve the level of visibility into the execution.

Using VTune to improve concurrency

The following assumes Intel Parallel Studio is installed and you have executed psxevars.sh script.

For mris_fix_topology the normal testing commands are

cd mris_fix_topology ; rm -rf testdata; tar xzf testdata.tar.gz
cd testdata 
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
../mris_fix_topology -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

The last is the command whose behaviour I will study.

The first step is to build the VTune Amplifier command that will collect the data. Start the GUI...

amplxe-gui &

Create a project. I made my project name study_mris_fix_topology

Specifying Launch Application

Fill in the Application by browsing to it. Mine was /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology

Copy the rest of the Application parameters from above -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

Specify the Working directory. For me it was /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata

In the Advanced section, I increased the Result size from 500 MB to 5000 MB. I did not specify any user-defined environment variables, because we will do this ourselves below.

Specify How to analyse the execution

You may see an error message in this panel telling you to set /proc/sys/kernel/yama/ptrace_scope to 0 You must do this, for instance by...

sudo -i
echo 0 > /proc/sys/kernel/yama/ptrace_scope
exit

The retry button then clears this error.

The default How is Basic Hotspots without analyzing OpenMP regions, but we really need to understand the OpenMP behaviour, so check that box. In the Details, you can see that it is sampling every 10ms, and collecting call stacks. All the defaults here are good enough.

Copy the command line

There is a button near the bottom that will show you the command line, which can then be copied to the clipboard.

My command line is shown below.

Run the collection

Because we have so much other stuff to set up, I don't use the GUI to start the collection. Instead I go to the command line terminal, do the setup as before, and then execute the command line

cd /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata
cp ./subjects/bert/surf/lh.orig{.before,}
export SUBJECTS_DIR=`pwd`/subjects
export OMP_NUM_THREADS=4
/opt/intel/vtune_amplifier_2019.0.0.553900/bin64/amplxe-cl -collect hotspots -knob analyze-openmp=true -app-working-dir /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/testdata -- /home/rheticus/freesurfer_repo_nf_faster_distance_map_update_6/freesurfer/mris_fix_topology/mris_fix_topology -niters 2 -mgz -sphere qsphere.nofix -ga -seed 1234 bert lh

Look at the result

There is a really simple summary printed

Summary
-------
Elapsed Time:             31.787
Paused Time:              0.0   
CPU Time:                 66.530
Average CPU Utilization:  2.095 
CPI Rate:                 1.014 

which tells us that we are only effectively using 2.095 of the available 4 cores. However to really see what is happening we need to run the GUI.

amplxe-gui r000hs/r000hs.amplxe &

The r000hs will have a new number for each run, and the hs stands for hotspot. We will see different numbers and abbreviations in later analyses.

The Project Navigator on the LHS is worthless. Click its X button to close it.

The Top Hotspots shows us

Function                                            Module             CPU Time 
func@0x11ad0                                        libgomp.so.1       13.6
mrisComputeDefectMRILogUnlikelihood_wkr._omp_fn.47  mris_fix_topology   9.5

This immediately shows the application is spending a lot of CPU time in the gomp stuff - locking or waiting!

MorphoOptimizationProject_improvingParallelism (last edited 2021-09-22 09:53:45 by DevaniCordero)