8 Simple Rules for Designing Threaded Applications 4
In a bottom-up approach, you would attempt to thread the hotspots in your code. If this is not possible, you can search up the call stack of the application to determine if there is another place in the code that can be run in parallel and still executes the hotspots. For example, if you have a picture compression application, you can divide the processing of the picture into separate, independent regions to be processed in parallel. Even if it is possible to employ concurrency at the hotspot code, you should still look to see if it would be possible to implement that concurrency at a point in the code higher up in the call stack. This can increase the granularity of the execution done by each thread.
With the top-down approach, you first consider the whole application, what the computation is coded to accomplish, and, at a very abstract level, all the parts of the app that combine to realize that computation. If there is no obvious concurrency, you should distill the parts of the computation into successively smaller parts until you can identify independent computations. Results from the Analysis phase can guide your investigation to include the most time-consuming modules. Consider threading a video encoding application. You can start at the lowest level of independent pixels within a single frame, or realize that groups of frames can be processed independent of other groups. If the video encoding app is expected to process multiple videos, expressing your parallelism at that level may be easier to write and will be at the highest level of possible concurrency.
The granularity of concurrent computations is loosely defined as the amount of computation done before synchronization is needed. The longer the time between synchronizations, the coarser the granularity will be. Fine-grained parallelism runs the danger of not having enough work assigned to threads to overcome the overhead costs of using threads. Adding more threads, when the amount of computation doesn’t change, only exacerbates the problem. Coarse-grained parallelism has lower overhead costs and also tends to be more readily scalable to an increase in the number of threads. Top-down approaches to threading (or driving the point of threading as high in the call stack) are the best options to achieve a coarse-grain solution.
Rule 3. Plan early for scalability to take advantage of increasing numbers of cores.
Processors have recently gone from being dual-core to quad-core. Intel has announced the 80-core Teraflop chip. The number of cores available in future processors will only increase. Thus, you should plan for such processor increases within your software. Scalability is the measure of an applic ation’s ability to handle changes, typically increases, in system resources (number of cores, memory size, bus speed, etc.) or data set sizes. In the face of more cores being available, write flexible code that can take advantage of different numbers of cores.
To paraphrase C. Northecote Parkinson, “Data expands to fill the processing power available.” This means that as the amount of computational power increases (the more cores), the more likely it will be that the data to be processed will expand. For the video encoding example above, it is unlikely the size of frames will change. Thus, increasing the number of threads working within a single frame when the number of cores increases will lower the total number of pixels each thread will process. This change will only serve to increase the overhead of using more threads and create an even more fine-grain execution. More likely is that the length or number of video streams to be encoded will increase. Adding more threads to handle longer (or additional) video streams will divide the increased workload at the point where it has actually increased. This makes the application very scalable.
Designing and implementing concurrency by data decomposition methods will be more scalable than functional decompositions. The number of independent functions is likely limited and fixed during the course of an application’s execution. Once each independent function has a thread and core to execute on, increasing the number of threads to take advantage of more cores will not increase performance of the application. Since data size tends to increase with the number of cores available, data decomposition designs will have the best chance for scalability.
Even though an application has been written with threads assigned to independent functions, there still may be possibilities of using extra threads when the input workload increases. Consider the house building example from above. There are a finite number of separate tasks to be done. However, if the size of the house to be built was doubled, you would expect extra workers might be 
Rule 4. Make use of thread-safe libraries wherever possible.
If your hotspot computations can be executed through a library call, then you should strongly consider using an equivalent library function instead of executing hand-written code. It’s never a good idea to “reinvent the wheel” by writing code that performs calculations already encapsulated with library routines that have been optimized. Many libraries, such as Intel® Math Kernel Library (MKL) and Intel® Integrated Performance Primitives (IPP), have functions that are already threaded to take advantage of multi-core processors.
Even more important than using threaded library routines, though, is to ensure that any library calls used are thread-safe. That is, if library routines are called from two different threads, the results of each call will return the correct answers. If routines are referencing and updating shared variables within the library, there may be data races that can adversely affect the reliability of the computations. In order to be thread-safe, a library routine can be written to be re-entrant (never updates anything except local variables) or add synchronization to protect access to shared resources. Check the library documentation for the thread-safety of any third-party library you are using.
Rule 5. Use the right threading model.
If threaded libraries are insufficient to cover all the concurrency of an application and you must employ user-controlled threads, don’t use (more complex) explicit threads if OpenMP has all the functionality you need. Explicit threads do allow for finer control of the threading implementation. However, if you are only parallelizing compute-intensive loops or otherwise don’t need the extra flexibility you can get with explicit threads, there’s probably no reason to do more work than necessary. The more complex the implementation, the easier it will be to make a mistake and the harder it will be to maintain such code later.
OpenMP is focused on data decomposition methods, especially targeted to threading of loops that range over large data sets. Even if this is the only type of parallelism that can be introduced into an application, there may be external requirements (engineering practices dictated by your employer, management preferences) that will prohibit your use of OpenMP and you will need to implement your threading with an explicit model. In such a situation, you could still use OpenMP to prototype the planned concurrency and better estimate the potential performance gains, possible scalability, and how much effort will be needed to thread the serial code with explicit threads.

With serial computations, it is easy to predict the statement that will be executed following any other statement in a program. On the other hand, thread execution order is non-deterministic and controlled by the OS scheduler. What this means is that there is no reliable way of predicting the order of threads running from one execution to another or even which thread will be scheduled to run next. This is done primarily to hide execution latency within an application, especially when run on a system with fewer cores than threads. If a thread blocks because it needs memory that is not located in cache or to process an I/O request, the scheduler will swap out the blocked thread and swap in a thread that is ready to run.
Data races are a direct result of this scheduling non-determinism. If you assume that one thread will write a value into a shared variable before another thread will read that value, you may be right all of the time or you may be right some of the time or you may be right none of the time. Sometimes, if you’re lucky, the order of thread execution remains unchanged on a specific platform each and every time you run an application. Each and every difference between systems (bit locations on the disk or memory speed or frequency of the AC power coming out of the wall sockets) has the potential to alter the thread schedule observed. Code that relies on a particular order of execution among threads, through nothing more than positive thinking, will be in trouble with things like data races and deadlock.
From a performance perspective, it would be best to allow threads to run as unencumbered as possible, like greyhounds or thoroughbreds in a race. Don’t try to enforce a particular order of execution unless absolutely necessary. You need to be able to recognize those times when it is absolutely necessary and implement some form of synchronization to coordinate the execution order of threads relative to each other. Consider two friends reading a newspaper. Looking over the pages spread out on a table, the two may read at different rates and will likely be interested in different articles. Whichever friend finishes reading the pages first must wait for the other to complete her perusal of the articles. There are no restrictions on the order of articles either reader chooses to read or the amount of time one can devote. Each reader proceeds at their own pace and then synchronizes with the other reader before going to the next set of pages.
Rule 7. Use thread-local storage whenever possible; associate locks to specific data, if needed.
Synchronization is overhead that does not contribute to the furtherance of the computation, except to guarantee the correct answers are produced from the parallel execution of an application. It is a necessary evil. Even so, you should actively seek to keep the amount of synchronization to a minimum. This can most readily be done through the use of storage local to threads or use of exclusive memory locations (e.g., an array element indexed by thread ID).
Temporary work variables rarely need to be shared between threads. These should be declared or allocated locally to each thread. Variables that hold partial results for each thread should also be local to threads. Combining the partial results into a shared location will require some synchronization. Ensuring that the shared updates are done as infrequently as possible will keep the amount of overhead to a minimum. If using explicit threads, there are Thread Local Storage APIs available to enable the persistence of data local to threads from one threaded region to another or from one threaded function call to the next execution of the same function.
If local storage for each thread is not a valid option and you must coordinate access to shared resources through synchronization objects (like a lock), be sure to properly associate (or “attach”) locks to data items. The easiest way to do this is to have a 1:1 relationship of locks to datum. You can set a lock to protect more than one data item if the set of memory locations are alway
What if you have a large collection of data, for example, an array of 10,000 items? Protecting the whole array with a single lock will likely lead to a synchronization contention bottleneck. Should a different lock be created for each of the array elements? Even with 32 or 64 threads competing for access, this seems like a waste of memory space to protect against conflicts with a less than 1-in-100 average chance of occurring. Something in between these two extremes is called for, specifically, a set of “modulo locks.” Modulo locks protect every Nth instance of a data collection, where N is the number of locks used. For example, with two locks, one lock protects access to the even numbered elements and the other protects access to the odd-numbered. To access a protected element, a thread computes the modulus of the desired location and then must hold the corresponding modulo lock. The number of locks utilized should be based on the number of threads used and the likelihood of concurrently accessing the same element by two threads.
However you decided to associate locks to data items, never associate more than one lock to a single data object. Segal’s Law states “A man with a watch knows what time it is. A man with two watches is never sure.” If two different locks protect access to the same variable, one part of the code may use one lock for access while another section of code can use the other lock. Threads executing in these two code portions will create a data race since both will assume they have exclusive access to the contested data.
Rule 8. Don’t be afraid to change the algorithm for a better chance of concurrency.
For comparing performance of applications, both serial and parallel, the bottom line is wall clock execution time. When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution. This theoretic metric will almost always correlate with an application’s performance. That is, with everything else held constant, an application that uses an O(n log n) algorithm (i.e., quicksort) will run faster than an O(n2) algorithm (i.e., selection sort) that generates the same results.
In parallel applications, algorithms with a better asymptotic order of execution will run faster, too. Nonetheless, there will be times that the best serial algorithm will not be amenable to being parallelized. If a hotspot turns out to be something that can’t be easily turned into threaded code (and you can’t find a point higher in the call stack of the hotspot that can be made parallel), you should consider using a sub-optimal serial algorithm that may parallelize easier than the better serial algorithm currently in the code. Of course, there may be other, less drastic, changes that will allow you to parallelize a given portion of code.
As an example of this latter point, consider the linear algebra operation for the multiplication of two square matrices. Strassen’s Algorithm has one of the best asymptotic orders of execution, O(n2.81). This is obviously better than the O(n3) of the traditional triple nested loop algorithm. Strassen’s method divides up each of the matrices into four pieces and uses seven recursive calls to multiply the n/2 × n/2 submatrices. To parallelize these recursive calls, you could create a new thread to execute each of the seven independent submatrix multiplications until the submatrices achieve a given size. The number of threads will increase exponentially (much like the wives, cats, sacks, etc., coming from St. Ives). As the submatrices get smaller and smaller, the granularity of the assigned work given to a newly created thread will get finer and finer. Alternatively, a “pool” of seven threads could be spawned. Each of the initial seven submatrix multiplications would be assigned to one of the threads in the pool for concurrent execution. The thread pool executions proceed by recursively calling Strassen’s method to multiply the submatrices assigned, as the serial version had done. Systems with more than eight cores will hav e idle resources under this alternate scheme.
A much easier means to parallelize matrix multiplication is to use the triple-nested loop algorithm. There are several ways to perform data decomposition on the matrices (divide by rows, divide by columns, or divide by blocks) and assign the necessary computations 
Summary
We’ve covered 8 Simple Rules that you should keep in mind when you are designing the threading that will transform a serial application into a parallel version. By following the rules presented here and keeping in mind practical programming rules, you can more easily create concurrent solutions that will be more robust, less likely to contain threading problems, and move toward optimal performance in less time.
Source: Intel(R) Software Network.
Author: Clay Breshears
Clay is currently a Content Concierge for the Intel Corporation, specializing in multi-core and multithreaded programming and training. He is also the author of "The Art of Concurrency" to be published by O'Reilly Press in May 2009. Clay received his Ph.D. in Computer Science from the University of Tennessee, Knoxville, in 1996, but has been involved with parallel computation and programming for over twenty years; six of those years were spent in academia at Eastern Washington University and The University of Southern Mississippi.
Click here for related articles, code samples, tool downloads, discussion forums and blog
Comments
(1)
2010-05-26 12:15:01
Report
Reply to Anil shinde