

Alexandre David B2-206



- Introduction to Parallel Algorithms (Sven Skyum)
  - PRAM model
  - Optimality
  - Examples
- Physical Organization of Parallel Platforms (2.4)



#### Standard RAM Model

- Standard Random Access Machine:
  - Each operation load, store, jump, add, etc ...
  - takes one unit of time.
- Simple, generally one model.



#### Multi-processor Machines

- Numerous architectures
  - $\rightarrow$  different models.
- Difference in communication
  - Synchronous
  - Asynchronous
- Difference in memory layout
  - NUMA
  - UMA

## PRAM Model

- A PRAM consists of
  - a global access memory (i.e. shared)
  - a set of processors running the same program (though not always), with a private stack.
- A PRAM is synchronous.
- Unlimited resources.

### Classes of PRAM

- How to resolve contention?
  - EREW PRAM exclusive read, exclusive write
  - CREW PRAM concurrent read, exclusive write
  - ERCW PRAM exclusive read, concurrent write
  - CRCW PRAM concurrent read, concurrent write

### Example: Sequential Max

```
Function smax(A,n)

m := -\infty

for i := 1 to n do

m := max\{m,A[i]\}

od

smax := m

end
```

Time O(n)

#### Example: Sequential Max (bis)

```
Function smax2(A,n) Time O(n)

for i := 1 to n/2 do

B[i] := max\{A[2i-1],A[2i]\}

od

if n = 2 then
smax2 := B[1]
else
smax2 := smax2(B,n/2)
fi
end
```

#### Example: Parallel Max

```
Function smax2(A,n) [p<sub>1</sub>,p<sub>2</sub>,...,p<sub>n/2</sub>] Time O(log n) for i := 1 to n/2 pardo p_i : B[i] := max\{A[2i-1],A[2i]\} od if n = 2 then p_1 : smax2 := B[1] else smax2 := smax2(B,n/2) [p_1,p_2,...,p_{n/4}] fi end
```

#### Analysis of the Parallel Max

- Time:  $O(\log n)$  for n/2 processors.
- Work done?
  - p(n)=n/2 number of processors.
  - t(n) time to run the algorithm.
  - w(n)=p(n)\*t(n) work done. Here  $w(n)=O(n \log n)$ .

## Optimality

#### Definition

If w(n) is of the same order as the time for the best known sequential algorithm, then the parallel algorithm is said to be optimal.

## Design Principle

Construct optimal algorithms to run as fast as possible.

=

Construct optimal algorithms using as many processors as possible!

## 4

#### Brent's Scheduling Principle

#### Theorem

```
If a parallel computation consists of k phases taking time t_1, t_2, ..., t_k using a_1, a_2, ..., a_k processors in phases 1, 2, ..., k then the computation can be done in time O(a/p+t) using p processors where t = sum(t_i), a = sum(a_it_i).
```

#### Previous Example

- k phases =  $\log n$ .
- $t_i$  = constant time.
- $a_i = n/2, n/4, ..., 1$  processors.
- With p processors we can use time  $O(\log n + n/p)$ .
- Choose p=O(n/logn) → time O(logn) and this is optimal!

### **Prefix Computations**

```
Input: array A[1..n] of numbers.

Output: array B[1..n] such that B[k] = sum(i:1..k) A[i]

Sequential algorithm:

function prefix<sup>+</sup>(A,n) Time O(n)

B[1] := A[1]

for i = 2 to n do

B[i] := B[i-1]+A[i]

od

end
```

#### Parallel Prefix Computation

```
function prefix(A,n)[p_1,...,p_n]
        p_1: B[1] := A[1]
        if n > 1 then
                 for i = 1 to n/2 pardo
                          p_i: C[i]:=A[2i-1]+A[2i]
                 od
                 D:=prefix(C,n/2)[p_1,...,p_{n/2}]
                 for i = 1 to n/2 pardo
                          p; B[2i]:=D[i]
                 od
                 for i = 2 to n/2 pardo
                          p_i: B[2i-1]:=D[i-1]+A[2i-1]
                 od
        prefix+:=B
```

#### **Prefix Computations**

- The point of this algorithm:
  - It works because + is associative (i.e. the compression works).
  - It will work for any other associative operations.
  - Brent's scheduling principle:

For any associative operator computable in O(1), its prefix is computable in  $O(\log n)$  using  $O(n/\log n)$  processors, which is optimal!

## 4

#### Merging (of Sorted Arrays)

- Rank function:
  - rank(x,A,n) = 0 if x < A[1]
  - rank(x,A,n) = max{i | A[i] ≤ x}
  - Computable in time O(logn) by binary search.
- Merge A[1..n] and B[1..m] into C[1..n+m].
- Sequential algorithm in time O(n+m).

#### Parallel Merge

```
function merge1(A,B,n,m)[p_1,...,p_{n+m}]
      for i = 1 to n pardo p;
            IA[i] := rank(A[i]-1,B,m)
            C[i+IA[i]] := A[i]
      od
      for i = 1 to m pardo p;
            IB[i] := rank(B[i],A,n)
            C[i+IB[i]] := B[i]
      od
      merge1 := C
end
```

#### Simulating CRCW on EREW

- Assumption on addressed memory p(n)<sup>c</sup> for some constant c.
- Simulation algorithm idea:
  - Sort accesses.
  - Give priority to 1<sup>st</sup>.
  - Broadcast result for contentious accesses.
- Conclusion: Optimality can be kept with EREW-PRAM when simulating a CRCW algorithm.

#### Static vs. Dynamic Networks



#### **Bus Based Networks**



#### **Crossbar Networks**



### Multistage Networks



**Figure 2.9** The schematic of a typical multistage interconnection network.

#### Perfect Shuffle Pattern





#### Switches in Omega Networks





Configurations: pass-through and cross-over.

p/2 \* log p switching nodes: log p stages, p/2 inputs & outputs.

### Omega Network



**Figure 2.12** A complete omega network connecting eight inputs and eight outputs.

#### Blocking in Omega Networks



**Figure 2.13** An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB.

# Processors <-> Processors Networks



**Figure 2.14** (a) A completely-connected network of eight nodes; (b) a Star connected network of nine nodes.

Performant, very expensive.

Bottleneck, cheaper.

#### Linear Arrays and Meshes



**Figure 2.15** Linear arrays: (a) with no wraparound links; (b) with wraparound link.



**Figure 2.16** Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.

#### Hypercubes



**Figure 2.17** Construction of hypercubes from hypercubes of lower dimension.

4-D hypercube

100

110

14-02-2006

## Tree Based Networks



**Figure 2.18** Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

## Fat Trees



**Figure 2.19** A fat tree network of 16 processing nodes.



- All the previous topologies have advantages and disadvantages.
- Important factors: cost and performance.
- Define criteria to characterize cost and performance.

### Criteria

- Diameter: maximum distance  $p_a \leftrightarrow p_b$ .
- Connectivity.
- Bisection width.
- Bisection bandwidth.
- Cost.



**Figure 2.14** (a) A completely-connected network of eight nodes; (b) a Star connected network of nine nodes.



**Figure 2.15** Linear arrays: (a) with no wraparound links; (b) with wraparound link.



**Figure 2.16** Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.



**Figure 2.17** Construction of hypercubes from hypercubes of lower dimension.

14-02-2006



Figure 2.18 Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.

### Criteria

- Diameter.
- Connectivity: measure of multiplicity of paths.
- Bisection width.
- Bisection bandwidth.
- Cost.

#### Criteria

- Diameter.
- Connectivity.
- Bisection width: minimum number of links to cut in order to partition the network in 2 equal halves.
- Bisection bandwidth: minimum volume of communication allowed between 2 halves.
- Cost.



**Figure 2.20** Bisection width of a dynamic network is computed by examining various equipartitions of the processing nodes and selecting the minimum number of edges crossing the partition. In this case, each partition yields an edge cut of four. Therefore, the bisection width of this graph is four.

## Criteria

- Diameter.
- Connectivity.
- Bisection width.
- Bisection bandwidth.
- Cost: number of communication links, i.e., wires.

# Comparing The Topologies

**Table 2.1** A summary of the characteristics of various static network topologies connecting p nodes.

| 3.7                           | <b>.</b>                                                 | Bisection   | Arc          | Cost            |
|-------------------------------|----------------------------------------------------------|-------------|--------------|-----------------|
| Network                       | Diameter                                                 | Width       | Connectivity | (No. of links)  |
| Completely-connected          |                                                          | $(p^2/4)$   | p - 1        | p(p-1)/2        |
| Star                          | $\left(\begin{array}{c} \overline{2} \end{array}\right)$ | 1           | 1            | p-1             |
| Complete binary tree          | $2\log((p+1)/2)$                                         | 1           | 1            | p - 1           |
| Linear array                  | p - 1                                                    | 1           | 1            | p - 1           |
| 2-D mesh, no wraparound       | $2(\sqrt{p} - 1)$                                        | $\sqrt{p}$  | 2            | $2(p-\sqrt{p})$ |
| 2-D wraparound mesh           | $2\lfloor \sqrt{p}/2 \rfloor$                            | $2\sqrt{p}$ | 4            | 2p              |
| Hypercube                     | $\log p$                                                 | (p/2)       | $\log p$     | $(p \log p)/2$  |
| Wraparound $k$ -ary $d$ -cube | $d\lfloor k/2 \rfloor$                                   | $2k^{d-1}$  | 2 <i>d</i>   | dp              |

## Cache Coherence Protocols

- We need additional hardware to keep multiple copies of the same memory bank consistent with each other.
- We have seen that \$\$ is good but it does not come for free.
- Mechanism known as cache coherence protocol, usually described as state machines.



**Figure 2.21** Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables.



**Figure 2.22** State diagram of a simple three-state coherence protocol.

# Implementations of Cache Coherence Protocols

- Different ways to implement the protocol described by the state machine.
  - Snoopy cache: good on busses.
     Snoopy hardware that monitors states.
  - Directory based systems: states and presence bits for cache lines.
  - Distributed directory: physically distribute directory with memory.



Figure 2.24 A simple snoopy bus based cache coherence system.



**Figure 2.25** Architecture of typical directory based systems: (a) a centralized directory; and (b) a distributed directory.