|
Most parallelization tools to a good job of determining concurrency
in an application. Unfortunately, this is only half the information
needed to make an efficient parallelization. How the concurrency is
"mapped" onto a specific hardware platform is very important.
The goal, of course, is to keep all the processors busy doing
"important work" (not waiting or contributing to overhead).
Unfortunately there is no single method that is best for all cases. To
provide the most efficient parallelization, BERT 77 supports three
parallelization models. Depending on the application, one model may
work better than the others. The important issue is that the user has
three separate methods at their disposal.
Parallel Efficiency
Parallel efficiency depends on three factors:
- degree of parallelism (concurrency)
- amount of data transfers
- load balance
There are also some computer-dependent parameters like cache, which
in the this case are the responsibility of the compiler.
The degree of parallelism (concurrency) is property of program, not
the parallelization model. If a program lacks concurrency or has very
little concurrency, the parallelization model(and machine) will have
little chance of improving performance.
The other two factors, amount of data transfers and load balance
depends on the execution model. Although the speed of data transfers is
important, it is considered a property of the machine. For a given
parallel machine, it is constant and therefore, allows a direct
comparison of parallelization models. (NOTE: A change in machine, which
may change the communication/computation speed ratio, may also change
the efficiency of the model. The appropriateness of each
parallelization model must be investigated for each hardware
environment. BERT 77 can easily provide this information.)
A Simple Example:
Consider the following concurrent code fragment:
do k = 1, maxiter
do i = 1, n
a(i) = f(a(i), b(i))
enddo
enddo
print *,a
Arrays a and b are "input data" (IN) for the inner
DO loop. (i.e. these data are required on any node that will help with
the computations.) Array a is also "output data" (OUT). (i.e.
The complete a array is required to be on at least one node
after the parallel computation.)
The FLOW Model
The FLOW model is a bi-source model. A bi-source model requires that
all computations are performed on one processor, unless there is a
parallel part. The parallel part is executed on workers. So, for the
above code fragment, the FLOW model would look like:
MASTER PROCESSOR
(source code 1)
================
do k = 1, maxiter
.
.
(balance and send portions of previously calculated
arrays a and b to WORKER processors)
.
.
.
(collect array a from WORKER processors for
future use on MASTER processor)
.
.
enddo
print *,a
WORKER PROCESSOR
(source code 2)
================
(receive my portion of array a and b)
.
.
dynamic pardo i = 1, n
a(i) = f(a(i), b(i))
.
.
(send my portion of computed array a)
enddo
print *,a
In this model, IN data are sent to all WORKER processors, which may
result in high communication times. The OUT data are only sent to
MASTER processor, which usually requires a small amount of communication
time.
The DYNAMIC Model
The DYNAMIC model is a single-source model. All computations are
made on all processors. All data objects have correct values on all
processors therefore reducing the amount of data transfer. When a
parallel part is executed then all "input data" (arrays a and
b) are already available. At the end of the loop the "output
data" (computed array a) is sent to all processors.
SAME CODE ON ALL PROCESSORS
===========================
do k = 1, maxiter
dynamic pardo i = 1, n
a(i) = f(a(i), b(i))
enddo
.
.
(collect array a)
(broadcast array a)
.
.
enddo
print *,a
In this model IN data are not sent to processors eliminating a
certain amount of communication time. Instead, OUT data are broadcast to
ALL processors, which may result in high communication times.
Comparative efficiency of these two models (FLOW vs. DYNAMIC)
depends on amount of IN and OUT data for each concurrent part of the
program. Both methods may use dynamic load balancing to increase
performance. In addition, both FLOW and DYNAMIC require that the entire
program (code and data) "fit" on at least one of the processors.
The STATIC Model
The STATIC model is similar to the DYNAMIC model, but data transfers
are controlled by data distributions. (i.e. each processor operates on a
portion of the data, which has been distributed across all processors
prior to executing the program. The data distribution is "fixed" and
can not be altered at run-time). Each processor performs the same
calculations. When parallel portions are executed, each process
operates on a specific "slice" of data. Therefore, data transfers
typically are not connected with parallel parts of the program, but with
how they are used. Consider the example program under the STATIC model:
do k = 1, maxiter
static pardo i = my_part_start, my_part_end
a(i) = f(a(i), b(i))
enddo
enddo
(collect array a)
(broadcast array a)
print *,a
For many problems, the STATIC model can significantly reduce data
transfers and allow "large" problems to be distributed across the
processors (i.e. problems that will not fit on a single processor). The
STATIC model can NOT be load balanced at run-time possibly resulting in
lower efficiencies. In addition, there is overhead involved in handling
array distributions.
What Is The Best Model?
The best model depends largely on the application and how it was
written. We have found that for many applications the STATIC MODEL
works well, but in other cases the FLOW is much better. Historically,
the STATIC model was used for most hand parallelizations. Developing
static parallelizations by-hand is very difficult and time consuming due
to the complexities of data layout. Indeed, the programmer can not be
sure how well the parallelization will work until the code is written
and tested.
Attempting to test additional models by-hand is very time consuming.
BERT provides a very fast method for performing this type of analysis
without re-programming. In addition, BERT allows the user to test
different FORTRAN compilers and communication libraries (PVM vs. MPI)
|