Home

You do the science,
We do the cluster

Toll Free 888.237.8058

+1.610.865.6061

Search
Main Menu

BERT Parallelization Models PDF Print E-mail

Most parallelization tools to a good job of determining concurrency in an application. Unfortunately, this is only half the information needed to make an efficient parallelization. How the concurrency is "mapped" onto a specific hardware platform is very important.

The goal, of course, is to keep all the processors busy doing "important work" (not waiting or contributing to overhead). Unfortunately there is no single method that is best for all cases. To provide the most efficient parallelization, BERT 77 supports three parallelization models. Depending on the application, one model may work better than the others. The important issue is that the user has three separate methods at their disposal.

Parallel Efficiency

Parallel efficiency depends on three factors:

  1. degree of parallelism (concurrency)
  2. amount of data transfers
  3. load balance

There are also some computer-dependent parameters like cache, which in the this case are the responsibility of the compiler.

The degree of parallelism (concurrency) is property of program, not the parallelization model. If a program lacks concurrency or has very little concurrency, the parallelization model(and machine) will have little chance of improving performance.

The other two factors, amount of data transfers and load balance depends on the execution model. Although the speed of data transfers is important, it is considered a property of the machine. For a given parallel machine, it is constant and therefore, allows a direct comparison of parallelization models. (NOTE: A change in machine, which may change the communication/computation speed ratio, may also change the efficiency of the model. The appropriateness of each parallelization model must be investigated for each hardware environment. BERT 77 can easily provide this information.)

A Simple Example:

Consider the following concurrent code fragment:

        do k = 1, maxiter
           do i = 1, n
              a(i) = f(a(i), b(i))
           enddo
        enddo
        print *,a

Arrays a and b are "input data" (IN) for the inner DO loop. (i.e. these data are required on any node that will help with the computations.) Array a is also "output data" (OUT). (i.e. The complete a array is required to be on at least one node after the parallel computation.)

The FLOW Model

The FLOW model is a bi-source model. A bi-source model requires that all computations are performed on one processor, unless there is a parallel part. The parallel part is executed on workers. So, for the above code fragment, the FLOW model would look like:

 MASTER PROCESSOR                  
 (source code 1)                           
 ================
  do k = 1, maxiter
   .
   .
   (balance and send portions of previously calculated
    arrays a and b to WORKER processors)
   .
   .
   .
   (collect array a from WORKER processors for 
    future use on MASTER processor)
   .
   .
  enddo
  print *,a


 WORKER PROCESSOR
 (source code 2)
 ================
  (receive my portion of array a and b)
     .
     .
  dynamic pardo i = 1, n
  a(i) = f(a(i), b(i))
     .
     .
  (send my portion of computed array a)
  enddo
  print *,a

In this model, IN data are sent to all WORKER processors, which may result in high communication times. The OUT data are only sent to MASTER processor, which usually requires a small amount of communication time.

The DYNAMIC Model

The DYNAMIC model is a single-source model. All computations are made on all processors. All data objects have correct values on all processors therefore reducing the amount of data transfer. When a parallel part is executed then all "input data" (arrays a and b) are already available. At the end of the loop the "output data" (computed array a) is sent to all processors.

SAME CODE ON ALL PROCESSORS
===========================

      do k = 1, maxiter
         dynamic pardo i = 1, n
            a(i) = f(a(i), b(i))
         enddo
           .
           .
         (collect array a)
         (broadcast array a)
           .
           .
      enddo
      print *,a

In this model IN data are not sent to processors eliminating a certain amount of communication time. Instead, OUT data are broadcast to ALL processors, which may result in high communication times.

Comparative efficiency of these two models (FLOW vs. DYNAMIC) depends on amount of IN and OUT data for each concurrent part of the program. Both methods may use dynamic load balancing to increase performance. In addition, both FLOW and DYNAMIC require that the entire program (code and data) "fit" on at least one of the processors.

The STATIC Model

The STATIC model is similar to the DYNAMIC model, but data transfers are controlled by data distributions. (i.e. each processor operates on a portion of the data, which has been distributed across all processors prior to executing the program. The data distribution is "fixed" and can not be altered at run-time). Each processor performs the same calculations. When parallel portions are executed, each process operates on a specific "slice" of data. Therefore, data transfers typically are not connected with parallel parts of the program, but with how they are used. Consider the example program under the STATIC model:

        do k = 1, maxiter
           static pardo i = my_part_start, my_part_end
              a(i) = f(a(i), b(i))
           enddo
        enddo
        (collect array a)
        (broadcast array a)
        print *,a

For many problems, the STATIC model can significantly reduce data transfers and allow "large" problems to be distributed across the processors (i.e. problems that will not fit on a single processor). The STATIC model can NOT be load balanced at run-time possibly resulting in lower efficiencies. In addition, there is overhead involved in handling array distributions.

What Is The Best Model?

The best model depends largely on the application and how it was written. We have found that for many applications the STATIC MODEL works well, but in other cases the FLOW is much better. Historically, the STATIC model was used for most hand parallelizations. Developing static parallelizations by-hand is very difficult and time consuming due to the complexities of data layout. Indeed, the programmer can not be sure how well the parallelization will work until the code is written and tested.

Attempting to test additional models by-hand is very time consuming. BERT provides a very fast method for performing this type of analysis without re-programming. In addition, BERT allows the user to test different FORTRAN compilers and communication libraries (PVM vs. MPI)

Last Updated ( Monday, 17 July 2006 )
 
 
 
  © 2008 Seagrove LLC, All rights reserved. Basement Supercomputing and the Basement Supercomputing Logo are trademarks of Seagrove LLC.