We saw several forms of parallelism that exist at various levels of hardware and software. To fully exploit the parallelism that exists at the machine level (e.g., VLIW, instruction pipelining, superscalar architecture, etc.) we rely heavily on compilers. Sometimes it is possible to rewrite the sources to improve performance on a particular architecture. However, having the compiler transform and optimize the source code based on the architecture is, perhaps, the best long term solution in this world of growing software lives and shrinking hardware lives. The philosophical point of view is that the users write specifications of their algorithms in a “high-level language” and it is the job of the compiler to ensure that that specification is correctly executed in the most efficient manner possible on a given hardware platform.
For the rest of this discussion we will work under the assumption of imperative (as against declarative) languages. Moreover, we will assume an Algol-like language, such as C or Fortran. This is a pragmatic approach since a vast majority of parallel applications today are written in such languages. We will call these “high-level” languages to distinguish these from low-level assembly languages.
The most traditional form of compilation is translating a program in a high-level language to machine language or “object” code, which can often be directly linked and executed. This translation is typically carried out using one or more intermediate languages. One could view the compiler as successively lowering the language to the final machine language.
C / Fortran Source → intermediate language → ... → machine language
Parallel programs can be written using the knowledge of the underlying architecture and utilizing synchronization and message passing primitives. We have already seen this when we discussed shared-memory and distributed-memory machines and different styles to program those. Such parallel programs are called explicitly parallel. Alternatively, we could imagine the user writing a sequential program in a standard language and letting the compiler automatically discover parallelism in the program and then generating machine code to exploit that parallelism. This is implicit parallelism. Notice that in such cases the user may provide some extra input to the compiler as an aid to parallelization. Can you think of an example of a mechanism that enables users to convey information about code in an implicitly parallel style of programming? For the rest of our discussion we will assume an implicitly parallel model of writing parallel programs. We will call a compiler that compiles a sequential program a scalar compiler. A compiler that also parallelizes the input code is called a parallelizing compiler.
Parallelizing compilers often generate the Single Program Multiple Data (SPMD) style of code. A common technique is to use two phases of compilation. In the first phase the sequential code is translated into the SPMD code through source-to-source translation. In the second phase the “node code” is compiled using a standard scalar compiler. This allows separation of concerns—an engineering approach in compiler design to focus on one aspect of compilation at a time. In addition, the two phases also enable both coarse grained and fine-grained parallelisms to be exploited. For example, the first phase may parallelize loops and the second phase may compile the parallel loops to utilize the memory hierarchy.
Sequential C / Fortran → SMPD C / Fortran → intermediate language → ... → machine languageWe will focus on source-to-source transformations.
There are many potential ways to transforms programs to improve their performance. One widely employed class of transformations is what are called reordering transformations. As the name suggests, these transformations change the order of computations in a program. Reordering transformations are a very powerful class of transformations and can be used at source-level as well as at the intermediate-language level.
Consider a simple implementation of matrix-multiply. (We will use a MATLAB-like syntax.)
This code works very well on a scalar machine that executes one instruction at a time. However, suppose that we had a machine that had a 64-word vector floating point unit. Clearly the loop in this form does not take advantage of the vector unit. We could rewrite the code.for i = 1:N { for j = 1:N { C(i,j) = 0.0; for k = 1:N { C(i,j) += A(i,k) * B(k,j); } } }
for i = 1:N { for j = 1:64:N { C(i,j:j+63) = 0.0; for k = 1:N { C(i,j:j+63) += A(i,k) * B(k,j:j+63) } } }
(The syntax “a:b” in an array-index indicates a section of that dimension from element number “a” through “b”. The syntax “L:S:U” in the for loop indicates a loop going from “L” through “U” in steps of “S”.) Notice that multiple columns of C are computed at one time. This involves reordering of computations in the j-loop. How would you reorder the matrix multiplication computation to take advantage of cache (assuming fully associative cache)?
Dependence analysis provides a way to reason about such reordering transformations. Reordering can be very useful for improving performance on scalar machines as well. We will focus on reordering transformations for parallelization. Formally:
Since computationally intensive programs are likely to spend most of their time in loops it makes sense to pay attention to improving the performance of loops by parallelizing them. Recall Amdahl's Law. Do you see an application of Amdahl's Law here? In a 1966 paper, Bernstein established that two iterations I1 and I2 can be safely execute in parallel if
This forms the basis for all loop parallelization algorithms.
Dependence is defined as a relation among statements of a sequential program that constrains the execution order of the statements. If <S1, S2> belongs to the dependence relation then the statement S1 must precede statement S2 in any valid execution of the program.
Dependences represent two kinds of constraints on program transformations.
<S1, S3> and <S2, S3> are data dependences.S1 pi = 3.141; S2 r = 5.0; S3 area = pi * r * r;
Here, the statement S1 must be executed before S2, since the execution of S2 is conditional upon the execution of the branch in S1. Executing S2 before S1 could generate a divide-by-zero error that would be impossible in the original code.for ( ... ) { S1 if (d == 0) continue; S2 x = x / d; ... }
Graphically the dependence is shown as an edge from S1 to S2 in a dependence graph. In other words the source of a dependence edge must be executed before the sink.
In terms of the orders of reads and writes there are three types of dependences:
We write S1 δ S2.S1 X = ... S2 ... = X
An antidependence is denote by S1 δ-1 S2.S1 ... = X S2 X = ...
We write S1 δo S2S1 X = ... S2 X = ...
Dependences in straight-line code are easy to define and understand. How will you construct the dependence graph for straight-line code? In a dependence graph each statement is represented as a node and a directed edge goes from a node n1 to n2 if and only if there is a dependence from statement S1 to S2. However, it is the loops that are the primary focus of our attention. For loops, the above definition is not sufficiently descriptive. We need some more machinery to describe dependences in loop more precisely.
In a loop-nest the nesting level of a loop is defined as one more than the number of loops that enclose it. Thus, the outermost loop has a nesting level 1.
i = (i1, i2, ..., in)where ik, 1 ≤ k ≤ n, represents the iteration number for the loop at nesting level k.
The set of all iteration vectors for a statement is called the iteration space. For example, in the following code
the iteration space for S is {(1,1), (1,2), (2,1), (2,2)}.for i = 1:2 { for j = 1:2 { S } }
Since iteration ordering is so important we define a lexicographic order on iterations. Suppose ik denotes the kth element of an iteration vector i. Further, assume that i[1:k] denotes the leftmost k elements of i.
This ordering and the definition of dependence leads us directly to the Loop Dependence Theorem.
Which transformations are “safe” to carry out? Clearly if the original and the transformed programs follow exactly the same states the two programs are identical. However, this is too restrictive. We consider observational equivalence as our guiding principle. Two computations are considered equivalent if, on the same inputs, they produce identical values for output variables at the time the output statements are executed and the output statements are executed in the same order. This definition permits reordering of computations and indicates that only those aspects of computations need be preserved that concern the output. (We will ignore exceptions for this discussion, assuming that the transformations do not introduce any new exceptions that did not exist in the original program.)
We call a transformation that preserves all the dependences in a program a valid transformation for the program. The next question is: Is the reverse of the above theorem true? In other words does every meaning preserving transformation necessarily preserve dependences? The answer is no, as the following example illustrates.
L0 for i = 1:N { L1 for j = 1:2 { S0 A(i,j) = A(i,j) + B; } S1 t = A(i,1); S2 A(i,1) = A(i,2); S3 A(i,2) = t; }
There are dependences from S0 to each of S1, S2, and S3. Thus, dependence preserving transformations cannot move the loop L1 after S3. However, if we do carry out this transformation it still preserves the meaning of the program because all elements of the array A receive the same update and swapping two of the values (which is what statements S1 through S3 do) will not change the final value of A.
It is convenient to characterize dependences by the distance between the source and the sink of a dependence in the iteration space of the loop nest containing the statements.
This leads us to the definition of direction vectors.
D(i,j)k = |
“<” if d(i,j)k > 0 “=” if d(i,j)k = 0 “>” if d(i,j)k < 0 |
Consider the following code.
Here, the statement S1 has a true dependence on itself with distance vector (1,0,-1) and the direction vector (<,=,>). Notice that for a dependence to exist the leftmost non-“=” component of the direction vector must not be “>”. Why?for i = 1:N { for j = 1:M { for k = 1:L { S1 A(i+1,j,k-1) = A(i,j,k) + 10; } } }
Consider a slightly more complicated example:
What are the all the dependences here? What are the distance and direction vectors corresponding to each of those dependences? The problem is that to be very precise the dependences exist between dynamic instances of the statements in loops, not the textual statements. Here, there is a true dependence from S1 to S2 for i-loop iterations 1 through 49. In each iteration the dependence distance changes. In the iteration 50 of the i-loop there is a true dependence with dependence distance zero. Finally, in all subsequent iterations of the i-loop (i.e., 51 through 99) there is an antidependence from S1 to S2, again with changing dependence distances.for j = 1, 10 { for i = 1, 99 { S1 A(i,j) = B(i,j) + X; S2 C(i,j) = A(100-i,j) + Y; } }
It is impractical to keep track of all these dependence distances (50 true dependences and 49 antidependences). Fortunately, a lot can be accomplished if we keep track of simply the distinct direction vectors. In this case there are three distinct direction vectors: (=,<) for the true dependence in the i-loop iterations 1 through 49, (=,=) for the true dependence in i-loop iteration 50, and (=,<) for the antidependence in the i-loop iterations 51 through 99.
Notice in the above example that certain dependences span iterations of the loop nest while certain others occur within the same iteration vector of the loop nest. Informally, the former are called loop-carried dependences and the latter are called loop-independent dependences. Thus, in the above example, the two direction vectors (=,<) correspond to loop-carried dependences and the direction vector (=,=) corresponds to the loop-independent dependence. More formally:
A loop-carried dependence is forward if, in the above definition, statement S1 occurs before S2 in the loop body. Otherwise, it is called a backward loop-carried dependence. It is also useful to define the level of a loop-carried dependence.
In other words, the level of a dependence is the nesting level of the outermost loop index that varies between the source and the sink. We also define the notion of satisfying a dependence. A dependence is said to be satisfied if transformations that fail to preserve it are precluded. Using this notion, we can state the reordering theorem.
The outcome of the above theorem is that we can satisfy any level-k dependences by not reordering iterations of the level-k loop. This can lead to some powerful transformations. Consider the following code.
for i = 1:10 { for j = 1:10 { for k = 1:10 { S A(i+1,j+2,k+3) = A(i,j,k) + B; } } }
What is the direction vector for this loop nest? The only dependence in this example is carried at level 1. Therefore, the code is equivalent to:
for i = 1:10 { for k = 10:-1:1 { for j = 1:10 { S A(i+1,j+2,k+3) = A(i,j,k) + B; } } }
This code has been obtained by interchanging the j- and k-loops and reversing the k-loop.
We will use the notation S1 δk S2 to denote a loop-carried dependence carried by the loop level k.
Why do we need two iteration vectors in the definition? This is because we do not want to preclude loop-independent dependences between statements that are part of different loop-nests. Consider the following example.
for i = 1:10 { S1 A(i) = ... } for i = 1:10 { S2 ... = A(i); }
Notice that in the case of loop-carried dependences we only worried about common loop nests. Why?
Movement of statements must be prohibited from their original iteration vectors to avoid transformations such as the following. Consider the following example.
for i = 1:N { S1 A(i) = B(i) + C; S2 D(i) = A(i) + E; }
If we were allowed to move statements out of their original iteration vectors we could rewrite the code as:
D(1) = A(1) + E; for i = 1:N-1 { S1 A(i-1) = B(i-1) + C; S2 D(i) = A(i) + E; } A(N) = B(N) + C;
This preserves all instances of the two statements, but violates the loop-independent dependence.
Vectorization is similar to parallelization in that all the iterations of the loop are executed in parallel. The difference is that we want to vectorize inner loops while we want to parallelize outer loops. Why do we want to vectorize inner loops but parallelize outer loops?
The following loop
for i = 1:N { X(i) = x(i) + C; }
can be directly vectorized to the following vector statement.
X(1:N) = X(1:N) + C;
However, the following loop cannot.
for i = 1:N { S X(i+1) = x(i) + C; }
The reason is that there is a loop-carried dependence from S to itself. Attempting to rewrite this statement as a vector statement
X(1:N) = X(1:N) + C;
would be incorrect.
The question is: Can we vectorize any loop that contains loop-carried dependences? Consider the following piece of code.
for i = 1:N { S1 A(i+1) = B(i) + C; S2 D(i) = A(i) + E; }
Clearly, there is a loop carried dependence here from statement S1 to S2. So, Theorem 6 does not help us. Neverthless, the code is equivalent to the following vector statements:
S1 A(2:N+1) = B(1:N) + C; S2 D(1:N) = A(1:N) + E;
Why are the two pieces of code equivalent? In this case vectorization is possible because the loop can be distributed around the two statements
for i = 1:N { S1 A(i+1) = B(i) + C; } for i = 1:N { S2 D(i) = A(i) + E; }
each of which can then be vectorized individually. Here, the loop-carried dependence is in the forward direction, but if it ran backwards we could interchange the statements within the loop body as long as there were no loop-independent dependences preventing that interchange (recall Theorem 3 of loop-carried dependences). However, if we have a cycle of dependences, as in
for i = 1:N { S1 B(i) = A(i) + E; S2 A(i+1) = B(i) + C; }
then the loop cannot be vectorized. Here, there is a loop-carried dependence from S2 to S1 and a loop-independent dependence from S1 to S2. This leads us to the following theorem.
procedure vectorize (L, D) // L is the maximal loop nest containing the statement to be vectorized // D is the dependence graph for statements in L find the set {S1, S2, ..., Sm} of maximal strongly connected regions in the dependence graph D restricted to L (e.g., use Tarjan's SCC algorithm); construct Lπ from L by reducing each Si to a single node and compute Dπ, the dependence graph naturally induced on Lπ by D; let {π1, π2, ..., πm} be the m nodes of Lπ numbered in an order consistent with Dπ; for i = 1:m { if (πi is a dependence cycle) { generate a for-loop around the statements in πi; } else { generate a vector statement vectorized with respect to every loop containing it; } }
The problem with simple algorithm is that it fails to vectorize certain cases. Consider the following example:
for i = 1:N { for j = 1:M { S A(i+1,j) = A(i,j) + B; } }
The statement S has a dependence on itself carried by the i-loop, which prevents the above algorithm from vectorizing it. However, we can satisfy the dependences carried by the i-loop by executing it sequentially and then vectorize the j-loop, as follows:
for i = 1:N { A(i+1,1:M) = A(i,1:M) + B; }
This motivates the following codegen algorithm that recursively examines loops surrounding each loop-nest and vectorizes the remaining loops inside as soon as the dependence cycle has been broken by executing outer loops sequentially.
procedure codegen (R, k, D) // R is the regions for which we need to generate code // k is the minimum nesting leve of possible parallel loops // D is the dependence graph among statements in R find the set {S1, S2, ..., Sm} of maximal strongly connected regions in the dependence graph D restricted to L (e.g., use Tarjan's SCC algorithm); construct Rπ from R by reducing each Si to a single node and compute Dπ, the dependence graph naturally induced on Rπ by D; let {π1, π2, ..., πm} be the m nodes of Rπ numbered in an order consistent with Dπ; for i = 1:m { if (πi is a dependence cycle) { generate a level-k for-loop statement; let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to πi; codegen(πi, k+1, Di); generate the closing for the level-k for-loop; } else { generate a vector statement for πi in ρ(πi)-k+1 dimensions, where ρ(πi) is the number of loops containing πi; } }
Randy Allen and Ken Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-Based Approach, Chapter 2. Morgan Kaufmann Publishers, 2002.
B629, Arun Chauhan, Department of Computer Science, Indiana University