How to Use QuickMP




This page includes basic quick-start instructions, miscellaneous tips, and some examples.  Everything here refers to the latest release version.


Basic Usage

Step 1: Include quickmp.h in your code.


#include <quickmp/quickmp.h>


Step 2: Link with standard platform-specific threading libraries.


UNIX – link with (i.e. linker flag –lpthread).  Also pass pthread flags to the compiler and linker (e.g., -pthread for gcc).


Windows – nothing to do, most likely.  You must link with a multithreaded version of the C runtime libraries (i.e. in Visual Studio this is under C/C++ -> Code Generation -> Runtime Library in the project properties); however, VC++ 2005 and later have removed the single threaded versions, so this is usually not an issue.


Step 3: Parallelize your for loop.


#include <quickmp/quickmp.h>

int main(int argc, char* argv[])


    // Same as for (int i = 0; i < 1000000; ++i)...

    QMP_PARALLEL_FOR(i, 0, 1000000)



    return 0;



General Info

o       By default a parallel for loop will use one thread per processor (determined at runtime).  Each thread will be assigned a subset of the loop iterations.

o       QuickMP is intended for shared memory programming, i.e. multithreaded programs/libraries running on a single machine with multiple processors all sharing the same memory.

o       QuickMP is designed for data parallelism, not task parallelism.  Thus, it is ideal for applications where the same operations are applied repeatedly on lots of data.

o       QuickMP follows a fork-join model.  The idea is that the main thread encounters a distinct parallel section, forks into a bunch of worker threads which split up the work, and then joins back into a single thread again.  (Internally there is a thread pool which waits in the background until needed.)

o       API macros are provided for general thread management (setting number of threads, getting thread indices, etc.), declaring shared variables, data synchronization (barriers and critical sections), scheduling hints for load balancing, and more.  See the API reference for more details.


The Rules

o       Like OpenMP, QuickMP requires the loop iterations to be order-independent.  One iteration cannot depend on the results from another iteration.  In other words, your code should give the same result even if the iterations run backwards or in a random order.

o       The loop index always begins at a given starting index and counts up.

o       The code inside of the loop is special; it has a different scope from the surrounding code.  It cannot directly access outside variables without using the shared data macros (described below).  It can, however, call any function and create local variables that would be allowed within a normal for loop in the same scope.

o       Iteration over STL containers is possible for those with random access iterators.  Elements are accessed using the loop index, either with the [] operator or iterator arithmetic (e.g., container.begin() + i).

o       Creating nested parallel for loops results in undefined behavior.

o       Creating multiple simultaneous parallel for loops (e.g., two different threads each start their own loops) results in undefined behavior.


Shared Data

o       There is a 2-step process for using shared variables within a loop:


int sharedValue = 8;

std::vector<double> sharedData(100);



QMP_PARALLEL_FOR(i, 0, 1000000)

    QMP_USE_SHARED(sharedValue, int);

    QMP_USE_SHARED(sharedData, std::vector<double>);

    processMyData(sharedData, I, sharedValue);



o       Shared variables that are modified must be protected with a critical section:


int sum = 0;


QMP_PARALLEL_FOR(i, 0, 1000000)

    QMP_USE_SHARED(sum, int);






Performance Considerations

o       Parallel for loops are most effective at a very high level in your program, not in the inner loops.

o       In practice runtime performance will be proportional to the number of available processors but will probably not increase linearly with more processors.  (See Amdahl's Law.)

o       QuickMP adds minimal runtime overhead for thread management.  In the single-threaded case the overhead is negligible.

o       Threads are created and destroyed only when the program starts and exits and when the number of threads is explicitly changed.

o       The parallel for loop macro provides an optional “schedule hint” argument to specify how the iterations should be distributed.  “Sequential” gives each thread equal chunks of the loop range and is preferred if each iteration takes the same amount of time.  “Interleaved” gives the threads alternating loop iterations and is preferred if the iterations could take different amounts of time.  The following example shows how to use this option:



int imageWidth = 1600;

int imageHeight = 1200;

int numPixels = imageWidth * imageHeight;




QMP_PARALLEL_FOR(pixelIndex, 0, numPixels, quickmp::INTERLEAVED)

    QMP_USE_SHARED(imageWidth, int);

    QMP_USE_SHARED(imageHeight, int);

    QMP_USE_SHARED(sceneObjects, std::vector<Object3D*>);

    Color c = fireRay(sceneObjects, imageWidth, imageHeight,


    Image[pixelIndex] = c;




o       Keep critical sections short to reduce thread idleness.  If possible, avoid critical sections altogether.

o       If possible, avoid sharing data among threads to improve cache performance.

o       Remember that calling global functions within threads can involve shared memory (e.g., rand() uses shared global state variables).

o       Parallel reads from the same memory location should cause little slowdown, but if that memory changes (e.g., when one thread writes to it), all threads using that memory must update their cached copies.

o       CPU architecture makes a big difference, especially when data is shared among threads.  Newer multicore chips are designed for better multithreaded performance (shared memory caches, faster interconnections).  (An interesting example is a program with lots of shared data running on a machine with two processor dies, each containing two cores.  This program will probably scale very well from 1 to 2 threads because of the fast communication between two cores on the same die.  However, it might not scale as well from 2 to 3 or 4 threads, depending on the communication speed between the two dies.)  Note that a shared cache can slow things down if two processors sharing the cache are both accessing totally different memory locations, and each are frequently accessing more memory than half of the cache size, causing constant cache misses for the other processor.

o       The following chart shows the results from an informative example: a ray tracing benchmark (included with the distribution) executed on two different CPU architectures.  First, note the minimal overhead required for a 1-thread parallel loop over the control (a non-QuickMP for loop).  Notice the excellent scalability on the dual-core Pentium D going from 1 thread to 2 threads (about 1.95X normal speed); going to 3 and 4 threads adds little benefit because there are only 2 processors.  The dual-core Xeon only improves by about 20% going from 1 thread to 2, possibly because both threads are running on one hyper-threaded core, which is not quite as good as two separate cores.  Going from 2 to 3 threads adds a much larger performance increase (roughly 1.8X normal speed) because at least one thread is running on each core.


Miscellaneous Tips

o       The quickmp.h header file is rather long and includes other system headers.  If desired, it can be split into a .h and .cpp file for faster compilation.  (The ideal split point has been marked in the header file.)

o       Almost all C rand() implementations are not thread-safe because their internal state variables are not protected from being accessed by any thread at any time.  If you are concerned about repeatable results, consider using rand_r() (a reentrant version of rand() available on some platforms) or some other reentrant random number generator.