There are four main causes for parallelism not resulting in speedup
- Insufficient work per thread
- Excessive locking
- Excessive memory traffic
- Work not spread equally between the threads, aka load imbalance
Insufficient work per thread
OpenMP starts the threads when it needs to, but keeps them idle and can assign work to them and start them fairly quickly. Never-the-less, it takes thousands of instructions to give a thread work, and so there must be thousands of instructions worth of work to be done. Because OpenMP assigns more than one iteration of a loop to a thread at once, this means that the number of iterations divided by the number of threads, multiplied by the amount of work in one iteration needs to be in the thousands of instructions. This is usually the case. Unless you use static(1) - don't do that unless each iteration has a lot of work to do! In general, let OpenMP decide how to schedule the work unless it results in an imbalance.
Locking and unlocking takes only a few instructions - unless another thread is also banging away on the same lock. Bouncing locks between threads is expensive. It is easy to use #pragma omp critical, but only do it if it locks less than 1% of the work. If concerned, use full omp_lock_t operations to reduce contention on a particular item.
Alternatively, create per-thread data structures to hold each thread's contribution and merge the results after the parallel loop has exited.
Excessive memory traffic
This is the biggest, and least understood, performance killer. If the data is written by one thread and is in its 256KB L2 cache, it can take the equivalent of twenty or more instructions to move the 64byte cache line it is in to another thread.
For example, if you have an 128KB float array written in one loop, and read in the following loop, it may require at least three or four operations per element to cover the cost of moving the data. There is a lot of system-specific variation in these numbers.
If most of the data is coming out of the L3 cache or the DRAM, then that can become the bottleneck very easily.
If the work is not evenly spread amongst the iterations, but instead is concentrate in a few regions of it, then the threads that don't get assigned such regions must wait for those that do.
This is quite possible given the spatial locality in our data.