There is a strong correlation between the portions of Shor's algorithm which would be performed on a classical computer, portions which would be performed on a quantum computer, and portions of the simulation that do not parallelize, and portions that do.
The parts of Shor's algorithm which are pre and post processing which take place on a classical computer are not good candidates for parallelization. In contrast, the portions which the quantum computer would perform are easily parallelized.
The basis for parallelization in the simulation of Shor's algorithm is that at several points in the sequential code there are for loops that iterate over large arrays, and modify each element in some uniform manner.
Given this type of parallelism, each of the Charm++, pthreads, and MPI paradigms has some appeals and disadvantages. Charm++'s object paradigm coincides with the sequential codes object oriented approach. Charm++'s load balancing features are not of much utility, as the work load between even array portions is very even to start with, and it is not clear that the overhead imposed by Charm++ would be recovered by superior load balancing. MPI seems reasonable, as each process element iterates over its own exclusive region in the parallelized code, however, we then must communicate each portion to a manager processor, who would perform various operations.
Pthreads seemed the most natural version, since each thread iterates over its unique set of array locations, there is no need for locks, and no danger of deadlock or data corruption. If the array locations are suitably large, there is very little false sharing due to different array portions existing in the same cache lines of different threads. Once deciding on the pthreads paradigm, there are two hurdles that must be overcome. We must decide on a synchronization method, and we must decide how to split work.
Splitting the work is achieved in range.h, where given the values of n, q, and num pthreads, assigns values to each of num pthreads array locations, such that the i'th array location of q range lower, q range upper, n range lower and n range upper contains the array locations where the i'th process element should begin and end processing.
Setting barriers is achieved in barrier.h. We simply implement a barrier with pthread locks. These barriers occur before and after each of the parallel sections. They are necessary because frequently we perform an operation that involves some result of the entire array just before or after these parallelized sections.
In accordance with Amdahl's law our speedup is limited by the speedup of the parallelized sections, but the parallelized sections are such a large portion of the total running time, that the speedup is nearly linear.