Dependence Analysis and Parallelism

We saw several forms of parallelism that exist at various levels of hardware and software. To fully exploit the parallelism that exists at the machine level (e.g., VLIW, instruction pipelining, superscalar architecture, etc.) we rely heavily on compilers. Sometimes it is possible to rewrite the sources to improve performance on a particular architecture. However, having the compiler transform and optimize the source code based on the architecture is, perhaps, the best long term solution in this world of growing software lives and shrinking hardware lives. The philosophical point of view is that the users write specifications of their algorithms in a “high-level language” and it is the job of the compiler to ensure that that specification is correctly executed in the most efficient manner possible on a given hardware platform.

For the rest of this discussion we will work under the assumption of imperative (as against declarative) languages. Moreover, we will assume an Algol-like language, such as C or Fortran. This is a pragmatic approach since a vast majority of parallel applications today are written in such languages. We will call these “high-level” languages to distinguish these from low-level assembly languages.

Compilation

The most traditional form of compilation is translating a program in a high-level language to machine language or “object” code, which can often be directly linked and executed. This translation is typically carried out using one or more intermediate languages. One could view the compiler as successively lowering the language to the final machine language.

C / Fortran Source → intermediate language → ... → machine language

Parallel programs can be written using the knowledge of the underlying architecture and utilizing synchronization and message passing primitives. We have already seen this when we discussed shared-memory and distributed-memory machines and different styles to program those. Such parallel programs are called explicitly parallel. Alternatively, we could imagine the user writing a sequential program in a standard language and letting the compiler automatically discover parallelism in the program and then generating machine code to exploit that parallelism. This is implicit parallelism. Notice that in such cases the user may provide some extra input to the compiler as an aid to parallelization. Can you think of an example of a mechanism that enables users to convey information about code in an implicitly parallel style of programming? For the rest of our discussion we will assume an implicitly parallel model of writing parallel programs. We will call a compiler that compiles a sequential program a scalar compiler. A compiler that also parallelizes the input code is called a parallelizing compiler.

Parallelizing compilers often generate the Single Program Multiple Data (SPMD) style of code. A common technique is to use two phases of compilation. In the first phase the sequential code is translated into the SPMD code through source-to-source translation. In the second phase the “node code” is compiled using a standard scalar compiler. This allows separation of concerns—an engineering approach in compiler design to focus on one aspect of compilation at a time. In addition, the two phases also enable both coarse grained and fine-grained parallelisms to be exploited. For example, the first phase may parallelize loops and the second phase may compile the parallel loops to utilize the memory hierarchy.

Sequential C / Fortran → SMPD C / Fortran → intermediate language → ... → machine language

We will focus on source-to-source transformations.

Need for Dependence Analysis

There are many potential ways to transforms programs to improve their performance. One widely employed class of transformations is what are called reordering transformations. As the name suggests, these transformations change the order of computations in a program. Reordering transformations are a very powerful class of transformations and can be used at source-level as well as at the intermediate-language level.

Consider a simple implementation of matrix-multiply. (We will use a MATLAB-like syntax.)

for i = 1:N {
  for j = 1:N {
    C(i,j) = 0.0;
    for k = 1:N {
       C(i,j) += A(i,k) * B(k,j);
    }
  }
}

This code works very well on a scalar machine that executes one instruction at a time. However, suppose that we had a machine that had a 64-word vector floating point unit. Clearly the loop in this form does not take advantage of the vector unit. We could rewrite the code.


for i = 1:N {
  for j = 1:64:N {
    C(i,j:j+63) = 0.0;
    for k = 1:N {
       C(i,j:j+63) += A(i,k) * B(k,j:j+63)
    }
  }
}

(The syntax “a:b” in an array-index indicates a section of that dimension from element number “a” through “b”. The syntax “L:S:U” in the for loop indicates a loop going from “L” through “U” in steps of “S”.) Notice that multiple columns of C are computed at one time. This involves reordering of computations in the j-loop. How would you reorder the matrix multiplication computation to take advantage of cache (assuming fully associative cache)?

Dependence analysis provides a way to reason about such reordering transformations. Reordering can be very useful for improving performance on scalar machines as well. We will focus on reordering transformations for parallelization. Formally:

Definition: A reordering transformation is any program transformation that merely changes the order of execution of the code, without adding or deleting any executions of any statements.

Bernstein's Conditions

Since computationally intensive programs are likely to spend most of their time in loops it makes sense to pay attention to improving the performance of loops by parallelizing them. Recall Amdahl's Law. Do you see an application of Amdahl's Law here? In a 1966 paper, Bernstein established that two iterations I₁ and I₂ can be safely execute in parallel if

iteration I₁ does not write into a location that is read by I₂,
iteration I₂ does not write into a location that is read by I₁, and
iteration I₁ does not write into a location that is also written into by iteration I₂.

This forms the basis for all loop parallelization algorithms.

Defining Dependence

Dependence is defined as a relation among statements of a sequential program that constrains the execution order of the statements. If <S₁, S₂> belongs to the dependence relation then the statement S₁ must precede statement S₂ in any valid execution of the program.

Dependences represent two kinds of constraints on program transformations.

Data dependences

Data dependences ensure that data are produced and consumed in the correct order. For example, in the code

S₁   pi = 3.141;
S₂   r = 5.0;
S₃   area = pi * r * r;

<S₁, S₃> and <S₂, S₃> are data dependences.

Control dependences

Control flow gives rise to control dependences. Consider the following code.

     for ( ... ) {
S₁     if (d == 0) continue;
S₂     x = x / d;
       ...
     }

Here, the statement S₁ must be executed before S₂, since the execution of S₂ is conditional upon the execution of the branch in S₁. Executing S₂ before S₁ could generate a divide-by-zero error that would be impossible in the original code.

We will talk about data dependences for the moment.

Definition: There is a data dependence from statement S₁ to statement S₂ (statement S₂ depends on statement S₁) if and only if (1) both statements access the same memory location and at least one of them stores into it, and (2) there is a feasible run-time execution path from S₁ to S₂.

Graphically the dependence is shown as an edge from S₁ to S₂ in a dependence graph. In other words the source of a dependence edge must be executed before the sink.

In terms of the orders of reads and writes there are three types of dependences:

True dependence

The first statement writes into a location that is read by the second.

S₁   X = ...
S₂   ... = X

We write S₁ δ S₂.

Antidependence

The first statement reads from a location into which the second statement writes.

S₁   ... = X
S₂   X = ...

An antidependence is denote by S₁ δ^-1 S₂.

Output dependence

Both statements write into the same location.

S₁   X = ...
S₂   X = ...

We write S₁ δ^o S₂

Dependences in Loops

Dependences in straight-line code are easy to define and understand. How will you construct the dependence graph for straight-line code? In a dependence graph each statement is represented as a node and a directed edge goes from a node n₁ to n₂ if and only if there is a dependence from statement S₁ to S₂. However, it is the loops that are the primary focus of our attention. For loops, the above definition is not sufficiently descriptive. We need some more machinery to describe dependences in loop more precisely.

Definition: For an arbitrary loop with the loop index, I running from L to U in steps of S, the iteration number i of a specific iteration is given by (I-L+S)/S, where I is the value of the index on that iteration.

In a loop-nest the nesting level of a loop is defined as one more than the number of loops that enclose it. Thus, the outermost loop has a nesting level 1.

Definition: Given a nest of n loops, the iteration vector i of a particular iteration of the innermost loop is a vector of integers that contains the iteration numbers for each of the loops in order of their nesting level. Thus, the iteration vector is given by

i = (i₁, i₂, ..., i_n)

where i_k, 1 ≤ k ≤ n, represents the iteration number for the loop at nesting level k.

The set of all iteration vectors for a statement is called the iteration space. For example, in the following code

for i = 1:2 {
  for j = 1:2 {
     S
  }
}

the iteration space for S is {(1,1), (1,2), (2,1), (2,2)}.

Since iteration ordering is so important we define a lexicographic order on iterations. Suppose i_k denotes the k^th element of an iteration vector i. Further, assume that i[1:k] denotes the leftmost k elements of i.

Definition: Iteration i precedes iteration j, denote by i < j, if and only if

i[1:n-1] < j[1:n-1] or
i[i:n-1] = j[1:n-1] and i_n < j_n.

This ordering and the definition of dependence leads us directly to the Loop Dependence Theorem.

1. Loop Dependence Theorem: There exists a dependence from statement S₁ to statement S₂ in a common nest of loops if and only if there exist two iteration vector i and j for the nest such that

i < j or i = j and there is a path from S₁ to S₂ in the body of the loop,
statement S₁ accesses memory location M on iteration i and statement S₂ accesses location M on iteration j, and
one of these accesses is a write.

Dependence and Transformations

Which transformations are “safe” to carry out? Clearly if the original and the transformed programs follow exactly the same states the two programs are identical. However, this is too restrictive. We consider observational equivalence as our guiding principle. Two computations are considered equivalent if, on the same inputs, they produce identical values for output variables at the time the output statements are executed and the output statements are executed in the same order. This definition permits reordering of computations and indicates that only those aspects of computations need be preserved that concern the output. (We will ignore exceptions for this discussion, assuming that the transformations do not introduce any new exceptions that did not exist in the original program.)

Definition: A reordering transformation preserves a dependence if it preserves the relative execution order of the source and the sink of that dependence.

2. Fundamental Theorem of Dependence: Any reordering transformation that preserves every dependence in a program preserves the meaning of that program.

Proof: The above theorem can be proved by contradiction. The only way the meaning of a program can change is if an output statement produces a different result. Suppose that S_k is the first output statement that produces a different result. It must see an incorrect value in some memory location M to have produced the incorrect result (since the statement itself is unchanged). There are only three possible scenarios.

A statement S_m that stored a value in M before S_k now writes into M after S_k. This means that the transformation violated the dependence S_m δ S_k.
A statement S_m that wrote into M after it was read by S_k now writes into M before S_k reads it. This would mean that the transformation violated the dependence S_m δ^-1 S_k.
Two statements both writing into M before it was read by S_k wrote into M in reverse order causing the wrong value to be left in M. This would mean that the transformation violated the output dependence between those two statements.

Since none of these could have happened, the output must be identical to the original program. How would you extend this reasoning to loops?

We call a transformation that preserves all the dependences in a program a valid transformation for the program. The next question is: Is the reverse of the above theorem true? In other words does every meaning preserving transformation necessarily preserve dependences? The answer is no, as the following example illustrates.

L₀ for i = 1:N {
L₁   for j = 1:2 {
S₀     A(i,j) = A(i,j) + B;
     }
S₁   t = A(i,1);
S₂   A(i,1) = A(i,2);
S₃   A(i,2) = t;
   }

There are dependences from S₀ to each of S₁, S₂, and S₃. Thus, dependence preserving transformations cannot move the loop L₁ after S₃. However, if we do carry out this transformation it still preserves the meaning of the program because all elements of the array A receive the same update and swapping two of the values (which is what statements S₁ through S₃ do) will not change the final value of A.

Refining Dependences in Loops

It is convenient to characterize dependences by the distance between the source and the sink of a dependence in the iteration space of the loop nest containing the statements.

Definition: If there is a dependence from statement S₁ on iteration i of a loop nest of n loops to statement S₂ on iteration j, then the dependence distance vector d(i,j) is defined as a vector of length n such that d(i,j)_k = j_k - i_k.

This leads us to the definition of direction vectors.

Definition: If there is a dependence from statement S₁ on iteration i of a loop nest of n loops to statement S₂ on iteration j then the dependence direction vector D(i,j) is defined as a vector of length n such that

D(i,j)_k =

“<” if d(i,j)_k > 0
“=” if d(i,j)_k = 0
“>” if d(i,j)_k < 0

Consider the following code.

for i  = 1:N {
  for j = 1:M {
    for k = 1:L {
S₁    A(i+1,j,k-1) = A(i,j,k) + 10;
    }
  }
}

Here, the statement S₁ has a true dependence on itself with distance vector (1,0,-1) and the direction vector (<,=,>). Notice that for a dependence to exist the leftmost non-“=” component of the direction vector must not be “>”. Why?

Consider a slightly more complicated example:

for j = 1, 10 {
  for i = 1, 99 {
S₁  A(i,j) = B(i,j) + X;
S₂  C(i,j) = A(100-i,j) + Y;
  }
}

What are the all the dependences here? What are the distance and direction vectors corresponding to each of those dependences? The problem is that to be very precise the dependences exist between dynamic instances of the statements in loops, not the textual statements. Here, there is a true dependence from S₁ to S₂ for i-loop iterations 1 through 49. In each iteration the dependence distance changes. In the iteration 50 of the i-loop there is a true dependence with dependence distance zero. Finally, in all subsequent iterations of the i-loop (i.e., 51 through 99) there is an antidependence from S₁ to S₂, again with changing dependence distances.

It is impractical to keep track of all these dependence distances (50 true dependences and 49 antidependences). Fortunately, a lot can be accomplished if we keep track of simply the distinct direction vectors. In this case there are three distinct direction vectors: (=,<) for the true dependence in the i-loop iterations 1 through 49, (=,=) for the true dependence in i-loop iteration 50, and (=,<) for the antidependence in the i-loop iterations 51 through 99.

Loop-carried and Loop-independent Dependences

Notice in the above example that certain dependences span iterations of the loop nest while certain others occur within the same iteration vector of the loop nest. Informally, the former are called loop-carried dependences and the latter are called loop-independent dependences. Thus, in the above example, the two direction vectors (=,<) correspond to loop-carried dependences and the direction vector (=,=) corresponds to the loop-independent dependence. More formally:

Definition: Statement S₂ has a loop-carried dependence on statement S₁ if and only if S₁ references location M on iteration i, S₂ references M on iteration j, and d(i,j) > 0 (i.e., D(i,j) contains a “<” as its leftmost non-“=” component).

A loop-carried dependence is forward if, in the above definition, statement S₁ occurs before S₂ in the loop body. Otherwise, it is called a backward loop-carried dependence. It is also useful to define the level of a loop-carried dependence.

Definition: The level of a loop-carried dependence is the index of the leftmost non-“=” component of D(i,j) for the dependence.

In other words, the level of a dependence is the nesting level of the outermost loop index that varies between the source and the sink. We also define the notion of satisfying a dependence. A dependence is said to be satisfied if transformations that fail to preserve it are precluded. Using this notion, we can state the reordering theorem.

3. Theorem of loop-carried dependences: Any reordering transformation that

preserves the iteration order of the level-k loop,
does not interchange any loop at level < k to a position inside the level-k loop, and
does not interchange any loop at level > k to a position outside the level-k loop

preserves all level-k dependences.

Proof: By definition, for any direction vector D(i,j) for a level-k dependence all of its components 1 through k-1 must be “=”. Thus, no reordering of the outer loops 1 through k-1 can change the sense of D(i,j). Moreover, no loops inside the level-k loop can ever become the carrier of a dependence since those loops are never interchanged to the outside of level-k. Since the order of level-k iterations is preserved the k^th direction of D(i,j) remains “<”. Thus the dependence is preserved.

The outcome of the above theorem is that we can satisfy any level-k dependences by not reordering iterations of the level-k loop. This can lead to some powerful transformations. Consider the following code.

for i = 1:10 {
  for j = 1:10 {
    for k = 1:10 {
S     A(i+1,j+2,k+3) = A(i,j,k) + B;
    }
  }
}

What is the direction vector for this loop nest? The only dependence in this example is carried at level 1. Therefore, the code is equivalent to:

for i = 1:10 {
  for k = 10:-1:1 {
    for j = 1:10 {
S     A(i+1,j+2,k+3) = A(i,j,k) + B;
    }
  }
}

This code has been obtained by interchanging the j- and k-loops and reversing the k-loop.

We will use the notation S₁ δ_k S₂ to denote a loop-carried dependence carried by the loop level k.

Definition: Statement S₂ has a loop-independent dependence on statement S₁ if and only if there exist two iterations i and j such that

statement S₁ refers to a memory location M on iteration i, S₂ refers to M on iteration j, and i = j; and
there is a control flow path from S₁ to S₂ within the iteration.

Why do we need two iteration vectors in the definition? This is because we do not want to preclude loop-independent dependences between statements that are part of different loop-nests. Consider the following example.

  for i = 1:10 {
S₁   A(i) = ...
  }

  for i = 1:10 {
S₂   ... = A(i);
  }

Notice that in the case of loop-carried dependences we only worried about common loop nests. Why?

4. Theorem of loop-independent dependences: If there is a loop-independent dependence from S₁ to S₂ any reordering transformation that does not move statement instances between iterations and preserves the relative order of S₁ and S₂ in the loop body preserves that dependence.

Proof: Since the reordering transformation is not allowed to move any statement outside its original loop iteration both statements S₁ and S₂ must move together. Further, their relative ordering is preserved, thus preserving the loop-independent dependence between them.

Movement of statements must be prohibited from their original iteration vectors to avoid transformations such as the following. Consider the following example.

  for i = 1:N {
S₁   A(i) = B(i) + C;
S₂   D(i) = A(i) + E;
  }

If we were allowed to move statements out of their original iteration vectors we could rewrite the code as:

  D(1) = A(1) + E;
  for i = 1:N-1 {
S₁   A(i-1) = B(i-1) + C;
S₂   D(i) = A(i) + E;
  }
  A(N) = B(N) + C;

This preserves all instances of the two statements, but violates the loop-independent dependence.

5. Iteration Reordering Theorem: A transformation that reorders the iterations of a level-k loop, without making any other changes, is valid if the loop carries no dependence.

Proof: Since no loop-independent dependence can be violated (by theorem 4), there are only two possible cases when a dependence may be violated:

The dependence is carried by an outer loop (i.e., at level k-1 or less). However, by theorem 3, those dependences are not affected by reordering k-level iterations.
The dependence is carried by an inner loop (i.e., at level k+1 or more). In this case, the dependence vector must have a “=” at position k, which is unaffected by changing the level-k iterations.

Thus, no dependences are affected.

Parallelization

6. Loop Parallelization Theorem: It is valid to convert a sequential loop to a parallel loop if the loop carries no dependence.

Proof: We must show that any interleaving of the statements in the parallelized loop causes no dependence to be violated. No loop-independent dependence or one carried by an inner loop can be violated because those iterations are executed in the same order as before. If the dependence is carried by an outer loop, theorem 5 assures us that no reordering at this level can violate that carried dependence.

Vectorization

Vectorization is similar to parallelization in that all the iterations of the loop are executed in parallel. The difference is that we want to vectorize inner loops while we want to parallelize outer loops. Why do we want to vectorize inner loops but parallelize outer loops?

The following loop

for i = 1:N {
  X(i) = x(i) + C;
}

can be directly vectorized to the following vector statement.

X(1:N) = X(1:N) + C;

However, the following loop cannot.

  for i = 1:N {
S    X(i+1) = x(i) + C;
  }

The reason is that there is a loop-carried dependence from S to itself. Attempting to rewrite this statement as a vector statement

X(1:N) = X(1:N) + C;

would be incorrect.

The question is: Can we vectorize any loop that contains loop-carried dependences? Consider the following piece of code.

  for i = 1:N {
S₁   A(i+1) = B(i) + C;
S₂   D(i) = A(i) + E;
  }

Clearly, there is a loop carried dependence here from statement S₁ to S₂. So, Theorem 6 does not help us. Neverthless, the code is equivalent to the following vector statements:

S₁   A(2:N+1) = B(1:N) + C;
S₂   D(1:N) = A(1:N) + E;

Why are the two pieces of code equivalent? In this case vectorization is possible because the loop can be distributed around the two statements

  for i = 1:N {
S₁   A(i+1) = B(i) + C;
  }
  for i = 1:N {
S₂   D(i) = A(i) + E;
  }

each of which can then be vectorized individually. Here, the loop-carried dependence is in the forward direction, but if it ran backwards we could interchange the statements within the loop body as long as there were no loop-independent dependences preventing that interchange (recall Theorem 3 of loop-carried dependences). However, if we have a cycle of dependences, as in

  for i = 1:N {
S₁   B(i) = A(i) + E;
S₂   A(i+1) = B(i) + C;
  }

then the loop cannot be vectorized. Here, there is a loop-carried dependence from S₂ to S₁ and a loop-independent dependence from S₁ to S₂. This leads us to the following theorem.

7. Loop Vectorization Theorem: A statement contained in at least one loop can be vectorized by directly rewriting it in a vector form if the statement is not included in any cycle of dependences.

Proof: In a vector statement all the inputs are read before any output is written. Thus, as long as we ensure that all the inputs needed by the statements in a loop are available before the beginning of the loop, the loop can be vectorized. This is exactly what was achieved by distributing the loop around the two statements in the last vectorization example. We can achieve the same effect by creating a totally ordered sequence of statement groups. Each group consists of statements that are either part of a dependence cycle or a single statement that is not part of any cycle. The total ordering is achieved by topologically sorting the statements. Such a topological order exists because there are no cycles between the statement groups. Finally, each statement group that has a single statement has all its inputs available at the beginning of the statement (because of the topological sort on dependences) and can be legally vectorized. This is formalized in the vectorization algorithm below. Since all dependences are preserved the algorithm carries out valid transformations.

procedure vectorize (L, D)
  // L is the maximal loop nest containing the statement to be vectorized
  // D is the dependence graph for statements in L

  find the set {S₁, S₂, ..., S_m} of maximal strongly connected
    regions in the dependence graph D restricted to L
    (e.g., use Tarjan's SCC algorithm);

  construct L_π from L by reducing each S_i to a single node and
    compute D_π, the dependence graph naturally induced on L_π by D;

  let {π₁, π₂, ..., π_m} be the m nodes of L_π
    numbered in an order consistent with D_π;

  for i = 1:m {
    if (π_i is a dependence cycle) {
      generate a for-loop around the statements in π_i;
    }
    else {
      generate a vector statement vectorized with respect to every loop containing it;
    }
  }

The problem with simple algorithm is that it fails to vectorize certain cases. Consider the following example:

  for i = 1:N {
    for j = 1:M {
S     A(i+1,j) = A(i,j) + B;
    }
  }

The statement S has a dependence on itself carried by the i-loop, which prevents the above algorithm from vectorizing it. However, we can satisfy the dependences carried by the i-loop by executing it sequentially and then vectorize the j-loop, as follows:

  for i = 1:N {
    A(i+1,1:M) = A(i,1:M) + B;
  }

This motivates the following codegen algorithm that recursively examines loops surrounding each loop-nest and vectorizes the remaining loops inside as soon as the dependence cycle has been broken by executing outer loops sequentially.

procedure codegen (R, k, D)
  // R is the regions for which we need to generate code
  // k is the minimum nesting leve of possible parallel loops
  // D is the dependence graph among statements in R

  find the set {S₁, S₂, ..., S_m} of maximal strongly connected
    regions in the dependence graph D restricted to L
    (e.g., use Tarjan's SCC algorithm);

  construct R_π from R by reducing each S_i to a single node and
    compute D_π, the dependence graph naturally induced on R_π by D;

  let {π₁, π₂, ..., π_m} be the m nodes of R_π
    numbered in an order consistent with D_π;

  for i = 1:m {
    if (π_i is a dependence cycle) {
      generate a level-k for-loop statement;
      let D_i be the dependence graph consisting of all dependence edges in D
        that are at level k+1 or greater and are internal to π_i;
      codegen(π_i, k+1, D_i);
      generate the closing for the level-k for-loop;
    }
    else {
      generate a vector statement for π_i in ρ(π_i)-k+1 dimensions,
        where ρ(π_i) is the number of loops containing π_i;
    }
  }

Reference

Randy Allen and Ken Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-Based Approach, Chapter 2. Morgan Kaufmann Publishers, 2002.

B629, Arun Chauhan, Department of Computer Science, Indiana University